Algorithm design and code optimization to speed-up bioinformatics software
LDhat is a Linux-based package written in C-language, used for analysis and calculation of recombination rate in large scale population genetic data using Hudson likelihood method, developed in Oxford University in 2004. It consists of various interlinked programs used for estimation of recombinatio...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/48453 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | LDhat is a Linux-based package written in C-language, used for analysis and calculation of recombination rate in large scale population genetic data using Hudson likelihood method, developed in Oxford University in 2004. It consists of various interlinked programs used for estimation of recombination rates in phased and unphased data with missing information. The estimation of these rates allows scientists to experiment on methods such as gene targeting, understanding mutations and predicting presence of certain disease-causing genes. It is used by many bio-informatics researchers, National Institute of Health, United States of America being a major user.
As of now, there are several parts of this program which may take up to several days to generate results, making it resource-consuming. The purpose of this project was to optimise the LDhat algorithm in order to speed-up the time taken by LDhat to process input files and generate results. Since this program is used for major bioinformatics studies, it was imperative that the optimisation techniques used do not affect the results generated.
The basic method used for speed-up in the scope of this project was using parallel programming language, OpenMPI, on the existing code with multi-core processors provided by the Bioinformatics lab. The results were tested against the previous code to ensure the validity of results obtained and compute the speed-up achieved. Several approaches towards parallelisation were employed and the report explains the reasons for success and failure of each of them.
The distributed-memory approach for parallel implementation of the code has successfully obtained almost linear speed-up in output generation by LDhat. The report compares various output graphs and speed obtained through this approach and makes recommendations which can be similarly employed in other parts of the program. |
---|