Two clustering problems in analyzing next generation sequencing data

As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes o...

全面介紹

Saved in:

書目詳細資料
主要作者:	Ye, Tian
其他作者:	Lian Heng
格式:	Theses and Dissertations
出版:	2016
主題:	DRNTU::Science::Chemistry::Biochemistry
在線閱讀:	https://hdl.handle.net/10356/65955
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University

id	sg-ntu-dr.10356-65955
record_format	dspace
spelling	sg-ntu-dr.10356-659552023-03-01T00:01:16Z Two clustering problems in analyzing next generation sequencing data Ye, Tian Lian Heng School of Physical and Mathematical Sciences DRNTU::Science::Chemistry::Biochemistry As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes of cancer cells. To cluster high dimensional data, it is also of interest to select the variables (genes) informative for clustering. A new penalized model-based method called PMixClus is presented in this thesis to select genes and perform clustering simultaneously. The negative binomial mixture model is developed for the nonnegative and discrete count data from RNA sequencing experiments. Moreover, our method can automatically determine the number of clusters using the Bayesian information criterion. Additionally, in the PMixClus hybridhierarchical tree guided by the output from model-based clustering can be applied to visualize partial clustering structure in a hierarchical way. Results of both simulated and real data demonstrate that our method perform better or equally well compared to other competitive methods. DNA methylation is a significant epigenetic modification to regulate gene transcription and plays a critical role in diseases. The whole genome bisulfite sequencing (WGBS) is a specific NGS technology for the detection of genome-wide DNA methylation at a single CpG site resolution. However, the high cost of such experiments and the complexity of data challenges the downstream analysis. We proposed a new tool called DMReSearch to identify differentially methylated regions (DMRs) based on the WGBS data. We developed a three-dimensional rank method to pre-cluster the CpG sites, which considers CpG density, distance between centers and fluctuation of differences between two biological groups. Then we smoothed the methylation levels in each cluster with a modified local kernel smoother, carried out statistical test at each CpG by using the beta-binomial distribution and accordingly trimed and merged the identified DMRs. We compared our method to BSmooth which is the most popular method to detect DMR based on WGBS data. In simulation experiments, DMReSearch presents better receiver operating characteristic curves. Real data experiments show that DMReSearch performs better smoothing results, reports less unreasonable DMRs and presents consistency between low- and high-coverage data sets. DOCTOR OF PHILOSOPHY (SPMS) 2016-02-04T09:22:26Z 2016-02-04T09:22:26Z 2016 Thesis Ye, T. (2016). Two clustering problems in analyzing next generation sequencing data. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65955 10.32657/10356/65955 126 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
topic	DRNTU::Science::Chemistry::Biochemistry
spellingShingle	DRNTU::Science::Chemistry::Biochemistry Ye, Tian Two clustering problems in analyzing next generation sequencing data
description	As the next generation sequencing (NGS) becomes the dominating technology for studying the gene expression profiles, downstream statistical analysis tools are needed urgently. Clustering samples is an important approach to revealing smaples’ relationships, such as for the discovery of new subtypes of cancer cells. To cluster high dimensional data, it is also of interest to select the variables (genes) informative for clustering. A new penalized model-based method called PMixClus is presented in this thesis to select genes and perform clustering simultaneously. The negative binomial mixture model is developed for the nonnegative and discrete count data from RNA sequencing experiments. Moreover, our method can automatically determine the number of clusters using the Bayesian information criterion. Additionally, in the PMixClus hybridhierarchical tree guided by the output from model-based clustering can be applied to visualize partial clustering structure in a hierarchical way. Results of both simulated and real data demonstrate that our method perform better or equally well compared to other competitive methods. DNA methylation is a significant epigenetic modification to regulate gene transcription and plays a critical role in diseases. The whole genome bisulfite sequencing (WGBS) is a specific NGS technology for the detection of genome-wide DNA methylation at a single CpG site resolution. However, the high cost of such experiments and the complexity of data challenges the downstream analysis. We proposed a new tool called DMReSearch to identify differentially methylated regions (DMRs) based on the WGBS data. We developed a three-dimensional rank method to pre-cluster the CpG sites, which considers CpG density, distance between centers and fluctuation of differences between two biological groups. Then we smoothed the methylation levels in each cluster with a modified local kernel smoother, carried out statistical test at each CpG by using the beta-binomial distribution and accordingly trimed and merged the identified DMRs. We compared our method to BSmooth which is the most popular method to detect DMR based on WGBS data. In simulation experiments, DMReSearch presents better receiver operating characteristic curves. Real data experiments show that DMReSearch performs better smoothing results, reports less unreasonable DMRs and presents consistency between low- and high-coverage data sets.
author2	Lian Heng
author_facet	Lian Heng Ye, Tian
format	Theses and Dissertations
author	Ye, Tian
author_sort	Ye, Tian
title	Two clustering problems in analyzing next generation sequencing data
title_short	Two clustering problems in analyzing next generation sequencing data
title_full	Two clustering problems in analyzing next generation sequencing data
title_fullStr	Two clustering problems in analyzing next generation sequencing data
title_full_unstemmed	Two clustering problems in analyzing next generation sequencing data
title_sort	two clustering problems in analyzing next generation sequencing data
publishDate	2016
url	https://hdl.handle.net/10356/65955
_version_	1759858217842638848

Two clustering problems in analyzing next generation sequencing data

相似書籍