Mining HIV : 1 information from literature

HIV-1 virus frequently mutates to increase resistance against certain drugs. The mutations are partly due to the histones modification in the patient’s genomes. Information of histones modifications are not easily accessible. There are online databases that contained a large amount of documents abou...

Full description

Saved in:
Bibliographic Details
Main Author: Lim, Clarence Jia Xian
Other Authors: School of Computer Engineering
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/59053
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:HIV-1 virus frequently mutates to increase resistance against certain drugs. The mutations are partly due to the histones modification in the patient’s genomes. Information of histones modifications are not easily accessible. There are online databases that contained a large amount of documents about the histones modification. However, they are very time consuming for biologist to retrieve manually. Thus, the project attempts to automate the retrieval of the information from the databases and integrate them into a single source for ease of access. The program created consists of certain components to aid the construction of the information source. Document Collection System is the first component of the program which collects documents and abstracts from the online databases and cleaned them for the next stage to process. TEES is the next component which takes in the cleaned documents and extracts the proteins and histone modification events from them. TEEStoCSV Convertor program takes the output of TEES and convert the individual file data into CSV format. Histone Events Compilation program combines the individual CSV files into 1 overall CSV file and filter out the invalid histones. Sampling Program takes the overall CSV file and randomly select 100 samples for the verification process. Normalization Program takes the overall CSV file and normalized the terms for the visualization program, Graphviz. GeneToUniprot program takes the overall CSV file and convert the genes names to Swiss-Prot IDs. Lastly, the XML Constructor program uses the output from the GeneToUniprot program and combined with an extracted histone file to construct the XML file. The overall design architecture uses a pipe and filter style to allow extensibility and ease of modification to individual components. The verification results were overall satisfied as more than half of the samples were correct. Some of the error types found were also able to be resolved. The final result of the program is a XML file which allows the information to be easily distributed and access. Some recommendation is suggested in this project to increase the quality of the results by improving the TEES system’s event detection.