FICS: Fast DNA/RNA to amino acid alignment using data level parallelism

Gene expression is one of the key areas of bioinformatics. It is used to determine the functionalities of a gene and discover the effects of external stimuli to an organism. This includes multiple steps: alignment, assembly, quantification, normalization, and modeling. This study will only focus on...

Full description

Saved in:
Bibliographic Details
Main Authors: Lim, Stanley Vincent Wee Ebol, Lim, Steven Edward Cheng, Ting, Carlos Louis Pacifico, Wong, Aaron Eldrich Cue
Format: text
Language:English
Published: Animo Repository 2022
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etdb_comtech/4
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:Gene expression is one of the key areas of bioinformatics. It is used to determine the functionalities of a gene and discover the effects of external stimuli to an organism. This includes multiple steps: alignment, assembly, quantification, normalization, and modeling. This study will only focus on the first step, which is the sequence alignment phase, where reads are mapped to a reference proteome. Frame alignment algorithm is specifically used to map a DNA/RNA sequence to a reference proteome. A non-model organism is an organism in which there is no proteome model, and it can be mapped in two ways: de novo mapping or close reference proteome mapping. In this study, the research focused on the close reference mapping of the Scylla serrata (mud-crab) by using the Drosophila melanogaster (fruit fly) as the reference proteome model. This would require mapping of millions of reads to the whole reference proteome, thus the need to speed up the process of the alignment phase. Since most of the frame algorithms are implemented sequentially, this study proposes FICS which is a DNA/RNA to protein sequence alignment implementation using data level parallelism. It includes a conversion of a sequential frame alignment algorithm to the SIMD paradigm and implementations to three different technologies namely, Intel SIMD ISA(AVX2), CUDA, and FPGA. Analysis shows that the Intel SIMD ISA implementation had a speedup of 3.5x with an average matrix computation time of 2.5ms. Furthermore, its memory consumption peaked at 231MB and required around 42-52 Watts of power during runtime. On the other hand, the CUDA implementation of the frame alignment algorithm in the SIMT paradigm resulted in suboptimal speeds, using up to 270MiB of memory space and took in around 61-63 Watts during runtime. The FPGA implementation only included the two input data preparations with a speedup of about 13940 times, consuming a maximum memory of 580KB, and having a power consumption of around 2 Watts.