Processing and analysis of DNA methylation and statistical integration with genetic and clinical data

Epigenetics is genetic regulation that is not directly encoded in the DNA sequence. DNA methylation is an important epigenetic mark and its levels at particular loci have been associated with health and disease states. Better understanding of DNA methylation may lead to development of improved ther...

Full description

Saved in:
Bibliographic Details
Main Author: Pan, Hong
Other Authors: Kwoh Chee Keong
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/69957
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-69957
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Pan, Hong
Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
description Epigenetics is genetic regulation that is not directly encoded in the DNA sequence. DNA methylation is an important epigenetic mark and its levels at particular loci have been associated with health and disease states. Better understanding of DNA methylation may lead to development of improved therapeutics for these diseases. Recent technical advances such as high-throughput arrays and next generation sequencing have enabled discovery of epigenetic correlates of clinical phenotypes encompassing hundreds of clinical samples at genome wide level with single base resolution. Epigenome wide data on large number of samples has necessitated bio-computing with efficient analysis of well-designed experiments, new data analytic pipelines, mathematical models and biological interpretation. This thesis addresses three major challenges. First, it is important to optimize the detection of epigenetic variation and minimize the impact of technical artefacts for a reliable analysis. Second, algorithms must be very efficient because multivariate statistical models are essential in association studies and they require high computational power when applied to genome-wide studies. Third, computational methods should facilitate biological discovery through large scaled data analyses. Corresponding to the challenges in epigenetics, three major contributions are presented in this thesis. First, a new pipeline has been developed to process and remove technical artefacts in genome-wide DNA methylation data generated by the most commonly used Illumnia Infinium HumanMethylation 450K array. Compared to pre-existing algorithms, data processed through this pipeline was in a better agreement with the data obtained from reduced representation bisulfite sequencing on the same clinical samples. This study was further extended to evaluate the emerging next generation sequencing technology, Methyl Capture Sequencing, in buccal clinical samples. This thesis provides a comprehensive comparison of array and sequencing data, across key functional genomic regions in terms of their coverage and concordance of methylation calls and the use in epigenomic wide analysis study. The second part presents a suite of statistical models developed to study the complicated relationships between Gene, Environment and Methylation. Three major functions GEM_Emodel, GEM_Gmodel and GEM_GxEmodel have been developed into an R package named GEM. Using matrix based iterative correlation and memory-efficient data analysis, GEM facilitates reliable millions of associations between DNA methylation, genetic variants and environmental factors within minutes, in a standard computational setting. GEM has been validated by comprehensive benchmarking and has now become a part of Bioconductor, an extensively used open source bioinformatics suite. Lastly, GEM was employed to study the DNA methylation and its integration with genetic variants and environmental influences of multi-ethnicity Asian neonates from a Singapore based birth cohort (Growing Up in Singapore Towards healthy Outcomes, GUSTO) and discover methylation changes associated with sub-optimal health outcomes in early life. In an analysis of 237 GUSTO neonatal methylomes, we found methylation quantitative trait loci were readily detected and the best explanation for 75% of the most variably methylated regions was due to the interaction of genotype with in utero environments. This study shed new light on the complex relationship between biological inheritance and individual prenatal experience suggesting the importance of considering both genetic variation and environmental factors in interpreting epigenetic variation. The GEM models were also applicable in finding that HIF3A DNA methylation measured in the umbilical cord of 991 newborns can aid understanding the genesis of adiposity at birth.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Pan, Hong
format Theses and Dissertations
author Pan, Hong
author_sort Pan, Hong
title Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
title_short Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
title_full Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
title_fullStr Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
title_full_unstemmed Processing and analysis of DNA methylation and statistical integration with genetic and clinical data
title_sort processing and analysis of dna methylation and statistical integration with genetic and clinical data
publishDate 2017
url http://hdl.handle.net/10356/69957
_version_ 1759855528122515456
spelling sg-ntu-dr.10356-699572023-03-04T00:52:17Z Processing and analysis of DNA methylation and statistical integration with genetic and clinical data Pan, Hong Kwoh Chee Keong School of Computer Science and Engineering Joanna Holbrook DRNTU::Engineering::Computer science and engineering Epigenetics is genetic regulation that is not directly encoded in the DNA sequence. DNA methylation is an important epigenetic mark and its levels at particular loci have been associated with health and disease states. Better understanding of DNA methylation may lead to development of improved therapeutics for these diseases. Recent technical advances such as high-throughput arrays and next generation sequencing have enabled discovery of epigenetic correlates of clinical phenotypes encompassing hundreds of clinical samples at genome wide level with single base resolution. Epigenome wide data on large number of samples has necessitated bio-computing with efficient analysis of well-designed experiments, new data analytic pipelines, mathematical models and biological interpretation. This thesis addresses three major challenges. First, it is important to optimize the detection of epigenetic variation and minimize the impact of technical artefacts for a reliable analysis. Second, algorithms must be very efficient because multivariate statistical models are essential in association studies and they require high computational power when applied to genome-wide studies. Third, computational methods should facilitate biological discovery through large scaled data analyses. Corresponding to the challenges in epigenetics, three major contributions are presented in this thesis. First, a new pipeline has been developed to process and remove technical artefacts in genome-wide DNA methylation data generated by the most commonly used Illumnia Infinium HumanMethylation 450K array. Compared to pre-existing algorithms, data processed through this pipeline was in a better agreement with the data obtained from reduced representation bisulfite sequencing on the same clinical samples. This study was further extended to evaluate the emerging next generation sequencing technology, Methyl Capture Sequencing, in buccal clinical samples. This thesis provides a comprehensive comparison of array and sequencing data, across key functional genomic regions in terms of their coverage and concordance of methylation calls and the use in epigenomic wide analysis study. The second part presents a suite of statistical models developed to study the complicated relationships between Gene, Environment and Methylation. Three major functions GEM_Emodel, GEM_Gmodel and GEM_GxEmodel have been developed into an R package named GEM. Using matrix based iterative correlation and memory-efficient data analysis, GEM facilitates reliable millions of associations between DNA methylation, genetic variants and environmental factors within minutes, in a standard computational setting. GEM has been validated by comprehensive benchmarking and has now become a part of Bioconductor, an extensively used open source bioinformatics suite. Lastly, GEM was employed to study the DNA methylation and its integration with genetic variants and environmental influences of multi-ethnicity Asian neonates from a Singapore based birth cohort (Growing Up in Singapore Towards healthy Outcomes, GUSTO) and discover methylation changes associated with sub-optimal health outcomes in early life. In an analysis of 237 GUSTO neonatal methylomes, we found methylation quantitative trait loci were readily detected and the best explanation for 75% of the most variably methylated regions was due to the interaction of genotype with in utero environments. This study shed new light on the complex relationship between biological inheritance and individual prenatal experience suggesting the importance of considering both genetic variation and environmental factors in interpreting epigenetic variation. The GEM models were also applicable in finding that HIF3A DNA methylation measured in the umbilical cord of 991 newborns can aid understanding the genesis of adiposity at birth. Doctor of Philosophy (SCE) 2017-04-05T07:40:23Z 2017-04-05T07:40:23Z 2017 Thesis Pan, H. (2017). Processing and analysis of DNA methylation and statistical integration with genetic and clinical data. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/69957 10.32657/10356/69957 en 140 p. application/pdf