Verifying data integrity of influenza genome database

Data integrity in big data work is crucial in any field. This project was set in the bioinformatics field, of which its purpose was to verify the data integrity of Nanyang Technological University Biomedical Informatics Lab (NTU BIL)’s integrated database. Particularly, this project dealt with integ...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Tse Yin
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/74063
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Data integrity in big data work is crucial in any field. This project was set in the bioinformatics field, of which its purpose was to verify the data integrity of Nanyang Technological University Biomedical Informatics Lab (NTU BIL)’s integrated database. Particularly, this project dealt with integrity verification of influenza genome sequence data in the integrated database, sourced from two major public databases, National Center for Biotechnology Information (NCBI) and Global Initiative on Sharing All Influenza Data (GISAID). To verify integrity of the data, the data integration processes for NCBI and GISAID databases must first be clearly examined. Data sourced from GISAID was to be converted to NCBI format before data integration could be carried out. After ensuring functionality of the databases, verification of data integrity could then be done. For verification of sequence data integrity, pairwise alignment was done on the amino acid level to determine sequence similarity of data in NTU BIL’s integrated database against reference protein sequence data. Each sequence record was then assigned a similarity score; low similarity sequence records were written out for further investigation. Results found that data integrity of the NCBI database was successfully preserved; there was no existence of problematic data records nor twilight-zone proteins. For easy reference, the code was also optimized with enhanced readability. As for the GISAID database, an existing bug which was unresolvable by the end of this project prevented data integrity verification work to be done. Recommendations on future work include (1) achieving functionality and data integrity of the GISAID database, (2) updating the current codes so that the latest versions of libraries are used, and (3) adopting a different approach to calculating the threshold for low similarity score. This would make maintenance of the databases easier and also increase accuracy of the data integrity verification work.