Effective graph-based algorithms for weak motif discovery in genomic sequences

This thesis aims to improve weak motif discovery in genomic sequences. The task is of primary significance and urgency because motifs provide the basis for biologists to derive knowledge about gene functions. The knowledge could reveal mechanisms of diseases and lead to novel molecular targets for i...

Full description

Saved in:
Bibliographic Details
Main Author: Sun, Hequan
Other Authors: Jagath C. Rajapakse
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/64820
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-64820
record_format dspace
spelling sg-ntu-dr.10356-648202023-03-04T00:50:08Z Effective graph-based algorithms for weak motif discovery in genomic sequences Sun, Hequan Jagath C. Rajapakse Low Yoke Han, Malcolm Hsu Wen Jing School of Computer Engineering DRNTU::Engineering::Computer science and engineering This thesis aims to improve weak motif discovery in genomic sequences. The task is of primary significance and urgency because motifs provide the basis for biologists to derive knowledge about gene functions. The knowledge could reveal mechanisms of diseases and lead to novel molecular targets for inventing therapeutic drugs. Nevertheless, due to prohibitive cost, traditional wet-lab techniques are no longer adequate for large scale data. In this regard, computational approaches can render valuable help. Computational discovery of weak motifs, however, remains challenging. Because many false instances of a degenerate motif can easily disguise the true ones, in spite of intensive research, performance of the existing algorithms for this problem is far from being satisfactory. Approximate algorithms based on Expectation Maximization or Gibbs Sampling can miss true instances; exact ones based on clique finding in graphs or generating-and validating patterns (candidate motifs) consume a large amount of time/space. Thus, there is much room for improving the algorithms. We propose three novel algorithms for discovering (weak) motifs from exact datasets, where each sequence contains at least one motif instance. 'freeMotif-BF is a treestructured algorithm, whose novelty lies in the construction of trees of motif instances in a breadth-first manner. Experiments demonstrate that 'freeMotif-BF is more scalable than the other existing algorithms, in terms of the length of motifs. However, 'freeMotifBF and many algorithms have difficulty in discovering very weak motifs due to enormous space requirement. Thus, the algorithm 'freeMotif-DF constructs trees in a depth-first manner, overcoming the space limitation. Another algorithm RecMotif finds cliques of motif instances in recursively constructed graphs also in a depth-first manner. RecMotif reduces space requirement significantly. Besides, it further improves efficiency in execution time for solving open challenge problems. We also propose two recursive algorithms for discovering motifs from noisy datasets, where some of the input sequences may contain no motif instances. The two generalized algorithms nTreeMotif and nRecMotif are improved from TreeMotif-BF and RecMotif respectively. The algorithms are based on efficient exclusion of noisy sequences and the improved construction of trees/ graphs. nTreeMotif and nRecMotif preserve accuracy and efficiency of TreeMotif-BF and RecMotif respectively for dealing with exact datasets. Moreover, they are more scalable in terms of the number of noisy sequences than the existing algorithms. The novel graph-based algorithms have successfully met the research objective. They can effectively discover weak motifs from datasets for which the existing algorithms have difficulty handling. Thus, they should be useful new additions to the repertoire of tools for bioinformatics. DOCTOR OF PHILOSOPHY (SCE) 2015-06-04T07:42:02Z 2015-06-04T07:42:02Z 2014 2014 Thesis Sun, H. (2014). Effective graph-based algorithms for weak motif discovery in genomic sequences. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/64820 10.32657/10356/64820 en 196 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Sun, Hequan
Effective graph-based algorithms for weak motif discovery in genomic sequences
description This thesis aims to improve weak motif discovery in genomic sequences. The task is of primary significance and urgency because motifs provide the basis for biologists to derive knowledge about gene functions. The knowledge could reveal mechanisms of diseases and lead to novel molecular targets for inventing therapeutic drugs. Nevertheless, due to prohibitive cost, traditional wet-lab techniques are no longer adequate for large scale data. In this regard, computational approaches can render valuable help. Computational discovery of weak motifs, however, remains challenging. Because many false instances of a degenerate motif can easily disguise the true ones, in spite of intensive research, performance of the existing algorithms for this problem is far from being satisfactory. Approximate algorithms based on Expectation Maximization or Gibbs Sampling can miss true instances; exact ones based on clique finding in graphs or generating-and validating patterns (candidate motifs) consume a large amount of time/space. Thus, there is much room for improving the algorithms. We propose three novel algorithms for discovering (weak) motifs from exact datasets, where each sequence contains at least one motif instance. 'freeMotif-BF is a treestructured algorithm, whose novelty lies in the construction of trees of motif instances in a breadth-first manner. Experiments demonstrate that 'freeMotif-BF is more scalable than the other existing algorithms, in terms of the length of motifs. However, 'freeMotifBF and many algorithms have difficulty in discovering very weak motifs due to enormous space requirement. Thus, the algorithm 'freeMotif-DF constructs trees in a depth-first manner, overcoming the space limitation. Another algorithm RecMotif finds cliques of motif instances in recursively constructed graphs also in a depth-first manner. RecMotif reduces space requirement significantly. Besides, it further improves efficiency in execution time for solving open challenge problems. We also propose two recursive algorithms for discovering motifs from noisy datasets, where some of the input sequences may contain no motif instances. The two generalized algorithms nTreeMotif and nRecMotif are improved from TreeMotif-BF and RecMotif respectively. The algorithms are based on efficient exclusion of noisy sequences and the improved construction of trees/ graphs. nTreeMotif and nRecMotif preserve accuracy and efficiency of TreeMotif-BF and RecMotif respectively for dealing with exact datasets. Moreover, they are more scalable in terms of the number of noisy sequences than the existing algorithms. The novel graph-based algorithms have successfully met the research objective. They can effectively discover weak motifs from datasets for which the existing algorithms have difficulty handling. Thus, they should be useful new additions to the repertoire of tools for bioinformatics.
author2 Jagath C. Rajapakse
author_facet Jagath C. Rajapakse
Sun, Hequan
format Theses and Dissertations
author Sun, Hequan
author_sort Sun, Hequan
title Effective graph-based algorithms for weak motif discovery in genomic sequences
title_short Effective graph-based algorithms for weak motif discovery in genomic sequences
title_full Effective graph-based algorithms for weak motif discovery in genomic sequences
title_fullStr Effective graph-based algorithms for weak motif discovery in genomic sequences
title_full_unstemmed Effective graph-based algorithms for weak motif discovery in genomic sequences
title_sort effective graph-based algorithms for weak motif discovery in genomic sequences
publishDate 2015
url https://hdl.handle.net/10356/64820
_version_ 1759857170798608384