Efficient agglomerative hierarchical clustering for biological sequence analysis

Cluster analysis or clustering is an important data mining technique widely used for pattern recognition and information retrieval. In the context of computational biology and bioinformatics, clustering has been successfully applied in many subfields such as in evolutionary biology to group homologo...

Full description

Saved in:
Bibliographic Details
Main Author: Nguyen, Thuy Diem
Other Authors: Bertil Schmidt
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65568
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Cluster analysis or clustering is an important data mining technique widely used for pattern recognition and information retrieval. In the context of computational biology and bioinformatics, clustering has been successfully applied in many subfields such as in evolutionary biology to group homologous genetic sequences into gene families, in transcriptomics to group genes with related expression patterns, or in ecology to describe communities of organisms in heterogeneous environments. Cluster analysis originated from an anthropology study by Driver and Kroeber in 1932. Since then, over a hundred clustering algorithms have been developed to target input datasets with different characteristics. Out of these algorithms, two most prevalent methods are hierarchical clustering and k-means clustering. The former algorithm is particularly useful for analysing genetic datasets in evolutionary biology studies because there are inherent hierarchical relationships amongst the genetic sequences extracted from related organisms in these datasets. The hierarchical clustering analysis of biological sequences is however computational expensive in terms of both execution time and memory usage. Consequently, this analysis was rarely applied to input datasets with more than ten thousand sequences. In recent years, new high-throughput sequencing technologies can produce more data in a shorter time and for a cheaper cost. As a result, more and more raw sequence data is efficiently produced from the genetic materials of live organisms, many of which are examined for the first time. Hence, there is a pressing need for more effective computational techniques to study the relationships among the newly-discovered species. Motivated by the necessity for more effective clustering algorithms to study biological sequence data, I explore the use of parallel computing technologies with new algorithms to perform agglomerative hierarchical sequence clustering in a more effective way without compromising the accuracy of the results. Specifically, I develop new parallel algorithms using general-purpose computing on graphics processing units (or GPGPU) to speed-up the most compute intensive part of the hierarchical clustering process: the pairwise distance matrix computation of the input sequences. Graphic cards from NVIDIA are becoming a commodity in recent years and are available in many personal computers nowadays. With an NVIDIA GPU card, any laptop or desktop running these parallel algorithms can speed up the matrix computation process by ten times to a hundred times faster compared to the computation on a CPU using a traditional sequential (single-threaded) algorithm. Besides reducing execution time, I have built a more memory-efficient and robust agglomerative hierarchical clustering algorithm. This new clustering method reduces memory usage by applying a data summarization technique to maintain a compact version of only a part of the distance matrix instead of loading the whole matrix into the main memory. An important feature of this algorithm is the capability to produce the same hierarchical structure as the standard method. The new algorithm supports all three popular linkage schemes including: average-linkage, single-linkage, and complete-linkage. Among them, average-linkage clustering is widely used in many research and real-world applications, making a memory-efficient average-linkage clustering algorithm in great demand. Amongst various types of biological sequence cluster analyses, this thesis focuses on a particular type of sequence clustering application called operational taxonomic unit (OTU) clustering to demonstrate the usefulness of the afore-mentioned efficient algorithms for processing large genomic datasets. I have developed two OTU clustering pipelines for 454 pyrosequencing datasets called CRiSPy-Embed and CRiSPy-CUDA. A comprehensive evaluation benchmark using randomly simulated datasets and popular mock datasets has been designed to test the performance of these pipelines against existing tools. The benchmark results show that these tools can produce similar or more accurate OTU groupings than most existing OTU hierarchical clustering tools in a much more efficient manner.