Verifying data integrity of influenza genome database

Data integrity in big data work is crucial in any field. This project was set in the bioinformatics field, of which its purpose was to verify the data integrity of Nanyang Technological University Biomedical Informatics Lab (NTU BIL)’s integrated database. Particularly, this project dealt with integ...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Tse Yin
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/74063
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-74063
record_format dspace
spelling sg-ntu-dr.10356-740632023-03-03T20:37:53Z Verifying data integrity of influenza genome database Ong, Tse Yin Kwoh Chee Keong School of Computer Science and Engineering Bioinformatics Research Centre DRNTU::Engineering::Computer science and engineering DRNTU::Library and information science Data integrity in big data work is crucial in any field. This project was set in the bioinformatics field, of which its purpose was to verify the data integrity of Nanyang Technological University Biomedical Informatics Lab (NTU BIL)’s integrated database. Particularly, this project dealt with integrity verification of influenza genome sequence data in the integrated database, sourced from two major public databases, National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing All Influenza Data (GISAID). To verify integrity of the data, the data integration processes for NCBI and GISAID databases must first be clearly examined. Data sourced from GISAID was to be converted to NCBI format before data integration could be carried out. After ensuring functionality of the databases, verification of data integrity could then be done. For verification of sequence data integrity, pairwise alignment was done on the amino acid level to determine sequence similarity of data in NTU BIL’s integrated database against reference protein sequence data. Each sequence record was then assigned a similarity score; low similarity sequence records were written out for further investigation. Results found that data integrity of the NCBI database was successfully preserved; there was no existence of problematic data records nor twilight-zone proteins. For easy reference, the code was also optimized with enhanced readability. As for the GISAID database, an existing bug which was unresolvable by the end of this project prevented data integrity verification work to be done. Recommendations on future work include (1) achieving functionality and data integrity of the GISAID database, (2) updating the current codes so that the latest versions of libraries are used, and (3) adopting a different approach to calculating the threshold for low similarity score. This would make maintenance of the databases easier and also increase accuracy of the data integrity verification work. Bachelor of Engineering (Computer Science) 2018-04-24T04:51:08Z 2018-04-24T04:51:08Z 2018 Final Year Project (FYP) http://hdl.handle.net/10356/74063 en Nanyang Technological University 45 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
DRNTU::Library and information science
spellingShingle DRNTU::Engineering::Computer science and engineering
DRNTU::Library and information science
Ong, Tse Yin
Verifying data integrity of influenza genome database
description Data integrity in big data work is crucial in any field. This project was set in the bioinformatics field, of which its purpose was to verify the data integrity of Nanyang Technological University Biomedical Informatics Lab (NTU BIL)’s integrated database. Particularly, this project dealt with integrity verification of influenza genome sequence data in the integrated database, sourced from two major public databases, National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing All Influenza Data (GISAID). To verify integrity of the data, the data integration processes for NCBI and GISAID databases must first be clearly examined. Data sourced from GISAID was to be converted to NCBI format before data integration could be carried out. After ensuring functionality of the databases, verification of data integrity could then be done. For verification of sequence data integrity, pairwise alignment was done on the amino acid level to determine sequence similarity of data in NTU BIL’s integrated database against reference protein sequence data. Each sequence record was then assigned a similarity score; low similarity sequence records were written out for further investigation. Results found that data integrity of the NCBI database was successfully preserved; there was no existence of problematic data records nor twilight-zone proteins. For easy reference, the code was also optimized with enhanced readability. As for the GISAID database, an existing bug which was unresolvable by the end of this project prevented data integrity verification work to be done. Recommendations on future work include (1) achieving functionality and data integrity of the GISAID database, (2) updating the current codes so that the latest versions of libraries are used, and (3) adopting a different approach to calculating the threshold for low similarity score. This would make maintenance of the databases easier and also increase accuracy of the data integrity verification work.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Ong, Tse Yin
format Final Year Project
author Ong, Tse Yin
author_sort Ong, Tse Yin
title Verifying data integrity of influenza genome database
title_short Verifying data integrity of influenza genome database
title_full Verifying data integrity of influenza genome database
title_fullStr Verifying data integrity of influenza genome database
title_full_unstemmed Verifying data integrity of influenza genome database
title_sort verifying data integrity of influenza genome database
publishDate 2018
url http://hdl.handle.net/10356/74063
_version_ 1759857067221319680