Hodge theory and its applications to molecular data analysis

The complete DNA sequencing of the entire human genome, or better known as the Human Genome Project, concluded in 2003. Ever since then, many headways have been made to better understand the organization of the human genome. There are many levels of organizations of the genome, its most basic unit,...

Full description

Saved in:
Bibliographic Details
Main Author: Koh, Ronald Joon Wei
Other Authors: Xia Kelin
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/159024
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The complete DNA sequencing of the entire human genome, or better known as the Human Genome Project, concluded in 2003. Ever since then, many headways have been made to better understand the organization of the human genome. There are many levels of organizations of the genome, its most basic unit, the nucleosome, which consists of DNA wrapped around histone proteins. Some of the higher levels of organization include tetranucleosomes, which consists of several nucleosomes, and topological associating domains (TADs), which are regions in the genome that self-interact more frequently with themselves compared to outside the TAD. In particular, TADs have been relatively new to the scene; there are currently no experimentally validated TADs. Furthermore, disruption of TAD boundaries are associated with several diseases like cancer. Nevertheless, there are currently many methods to call TADs, all of which are not based on a rigorous topological mathematical model. The HodgeRank algorithm, based on the Hodge decomposition theorem, gives us an avenue to quantify these self-interactions. The HodgeRank algorithm was previously used to rank imcomplete or imbalanced data from several e-commerce sites and movies from Netflix. We first show that HodgeRank can also be used to successfully quantify the ``curvedness'' of different biomolecules, such as modelling the protein folding process and comparing biomolecules of different scales and complexities. We then turn our attention back to Hi-C data, which encompasses TADs/TAD regions. We show that under a suitable metric, HodgeRank can be used to quantify the self-interactions within each TAD region of Chromosome 10, each of these regions generated by an existing TAD calling method. Solar power, a renewable source of energy, plays a significant role in allowing us to reduce our dependence on fossil fuels. Solar cells allow us to harness solar power by converting light energy from the Sun to electrical power through the photovoltaic effect, where light energy excites an electron, allowing it to reach a higher energy state. One of the types of materials that are used to make these solar cells are hybrid organic-inorganic perovskites (HOIPs). Not only have HOIPs been projected to be one of the most cost-effective options for future solar cells, its efficiency levels have risen from 5% to 25% in the last decade. Current machine learning-based perovskite designs rely heavily on the prediction of the bandgap of HOIPs. We show that the PerSpect-ML model, which is based on the generation of machine learning features using the eigenvalues of the Hodge Laplacian matrices, and previously applied to protein-ligand binding affinity prediction to great success, can be applied to the prediction of the bandgap of a comprehensive data set of hybrid organic-inorganic perovskites. We show that the resulting machine learning model not only significantly reduces the computational costs of current models, but also is superior in terms of overall predictive ability.