Design and analysis of bioinformatics algorithms on an FPGA platform

The appearance of newer sequencing technologies has largely improved the development of molecular biology and genomic research. A large volume of gene information or protein data can be generated with lower cost, which leads to the exponential growth of existing gene banks or databases. Thus, it bec...

Full description

Saved in:
Bibliographic Details
Main Author: Chen, Yupeng
Other Authors: Bertil Schmidt
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/59979
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The appearance of newer sequencing technologies has largely improved the development of molecular biology and genomic research. A large volume of gene information or protein data can be generated with lower cost, which leads to the exponential growth of existing gene banks or databases. Thus, it becomes a challenging task for conventional algorithms or tools to extract information with biological significance among these ever increasing databases. There is an urgent need for innovative methods, algorithms, or tools to accomplish these complicated data analysis tasks on a more computationally powerful platform. After decades of development, the FPGA has proved itself in the field of high performance reconfigurable computation. For each generation, one can expect an immediate performance boost with the help of newer manufacturing technologies and a larger volume of resources on a single chip, both of which make it a competitive candidate for application acceleration. This thesis investigates the potential of solving bioinformatics problems utilizing the outstanding computation capability of FPGAs. To give a quantitative analysis on an FPGA’s performance, we choose two bioinformatics problems as the case study: genome search and short read alignment. An efficient architecture is proposed and implemented to maximize the performance of each application on a Virtex-5 FPGA platform. For the genome search problem, we focused on the most time-consuming stage of NCBI BLASTN. Two innovative data structures, a parallel partitioned Bloom filter and a block hash table, are proposed to improve the computation efficiency. The Bloom filter design can support 16 data queries simultaneously with high filtration efficiency; while the latter bypasses the stringent requirement of a perfect/near-perfect hash function. The final system achieves over 10 times speedup against the single-thread NCBI BLASTN program testing with a 100kbase query sequence. In addition, the experiments also show that the match rate together with the degree of parallelism influence the system’s overall performance. For the short read alignment problem, the key is to efficiently locate the segment with a high degree of similarity in the reference genome for each short read. We choose the banded Smith-Waterman algorithm as the align core to narrow down the search space. An innovative architecture is proposed to approximate the conventional algorithm but with a higher degree of parallelism. The align core alone achieves over 30 times speedup. The final design is a hybrid system utilizing both CPU and FPGA. The influence of different partitioning strategies is examined and the performance analysis also indicates that the performance bottleneck changes from the alignment computation to the candidate region search.