Two clustering problems in analyzing next generation sequencing data

As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes o...

Full description

Saved in:
Bibliographic Details
Main Author: Ye, Tian
Other Authors: Lian Heng
Format: Theses and Dissertations
Published: 2016
Subjects:
Online Access:https://hdl.handle.net/10356/65955
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
id sg-ntu-dr.10356-65955
record_format dspace
spelling sg-ntu-dr.10356-659552023-03-01T00:01:16Z Two clustering problems in analyzing next generation sequencing data Ye, Tian Lian Heng School of Physical and Mathematical Sciences DRNTU::Science::Chemistry::Biochemistry As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes of cancer cells. To cluster high dimensional data, it is also of interest to select the variables (genes) informative for clustering. A new penalized model-based method called PMixClus is presented in this thesis to select genes and perform clustering simultaneously. The negative binomial mixture model is developed for the nonnegative and discrete count data from RNA sequencing experiments. Moreover, our method can automatically determine the number of clusters using the Bayesian information criterion. Additionally, in the PMixClus hybridhierarchical tree guided by the output from model-based clustering can be applied to visualize partial clustering structure in a hierarchical way. Results of both simulated and real data demonstrate that our method perform better or equally well compared to other competitive methods. DNA methylation is a significant epigenetic modification to regulate gene transcription and plays a critical role in diseases. The whole genome bisulfite sequencing (WGBS) is a specific NGS technology for the detection of genome-wide DNA methylation at a single CpG site resolution. However, the high cost of such experiments and the complexity of data challenges the downstream analysis. We proposed a new tool called DMReSearch to identify differentially methylated regions (DMRs) based on the WGBS data. We developed a three-dimensional rank method to pre-cluster the CpG sites, which considers CpG density, distance between centers and fluctuation of differences between two biological groups. Then we smoothed the methylation levels in each cluster with a modified local kernel smoother, carried out statistical test at each CpG by using the beta-binomial distribution and accordingly trimed and merged the identified DMRs. We compared our method to BSmooth which is the most popular method to detect DMR based on WGBS data. In simulation experiments, DMReSearch presents better receiver operating characteristic curves. Real data experiments show that DMReSearch performs better smoothing results, reports less unreasonable DMRs and presents consistency between low- and high-coverage data sets. DOCTOR OF PHILOSOPHY (SPMS) 2016-02-04T09:22:26Z 2016-02-04T09:22:26Z 2016 Thesis Ye, T. (2016). Two clustering problems in analyzing next generation sequencing data. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65955 10.32657/10356/65955 126 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
topic DRNTU::Science::Chemistry::Biochemistry
spellingShingle DRNTU::Science::Chemistry::Biochemistry
Ye, Tian
Two clustering problems in analyzing next generation sequencing data
description As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes of cancer cells. To cluster high dimensional data, it is also of interest to select the variables (genes) informative for clustering. A new penalized model-based method called PMixClus is presented in this thesis to select genes and perform clustering simultaneously. The negative binomial mixture model is developed for the nonnegative and discrete count data from RNA sequencing experiments. Moreover, our method can automatically determine the number of clusters using the Bayesian information criterion. Additionally, in the PMixClus hybridhierarchical tree guided by the output from model-based clustering can be applied to visualize partial clustering structure in a hierarchical way. Results of both simulated and real data demonstrate that our method perform better or equally well compared to other competitive methods. DNA methylation is a significant epigenetic modification to regulate gene transcription and plays a critical role in diseases. The whole genome bisulfite sequencing (WGBS) is a specific NGS technology for the detection of genome-wide DNA methylation at a single CpG site resolution. However, the high cost of such experiments and the complexity of data challenges the downstream analysis. We proposed a new tool called DMReSearch to identify differentially methylated regions (DMRs) based on the WGBS data. We developed a three-dimensional rank method to pre-cluster the CpG sites, which considers CpG density, distance between centers and fluctuation of differences between two biological groups. Then we smoothed the methylation levels in each cluster with a modified local kernel smoother, carried out statistical test at each CpG by using the beta-binomial distribution and accordingly trimed and merged the identified DMRs. We compared our method to BSmooth which is the most popular method to detect DMR based on WGBS data. In simulation experiments, DMReSearch presents better receiver operating characteristic curves. Real data experiments show that DMReSearch performs better smoothing results, reports less unreasonable DMRs and presents consistency between low- and high-coverage data sets.
author2 Lian Heng
author_facet Lian Heng
Ye, Tian
format Theses and Dissertations
author Ye, Tian
author_sort Ye, Tian
title Two clustering problems in analyzing next generation sequencing data
title_short Two clustering problems in analyzing next generation sequencing data
title_full Two clustering problems in analyzing next generation sequencing data
title_fullStr Two clustering problems in analyzing next generation sequencing data
title_full_unstemmed Two clustering problems in analyzing next generation sequencing data
title_sort two clustering problems in analyzing next generation sequencing data
publishDate 2016
url https://hdl.handle.net/10356/65955
_version_ 1759858217842638848