Hodge theory and its applications to molecular data analysis
The complete DNA sequencing of the entire human genome, or better known as the Human Genome Project, concluded in 2003. Ever since then, many headways have been made to better understand the organization of the human genome. There are many levels of organizations of the genome, its most basic unit,...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/159024 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The complete DNA sequencing of the entire human genome, or better known as the Human Genome Project, concluded in 2003. Ever since then, many headways have been made to better understand the organization of the human genome. There are many levels of organizations of the genome, its most basic unit, the nucleosome, which consists of DNA wrapped around histone proteins. Some of the higher levels of organization include tetranucleosomes, which consists of several nucleosomes, and topological associating domains (TADs), which are regions in the genome that self-interact more frequently with themselves compared to outside the TAD. In particular, TADs have been relatively new to the scene; there are currently no experimentally validated TADs. Furthermore, disruption of TAD boundaries are associated with several diseases like cancer.
Nevertheless, there are currently many methods to call TADs, all of which are not based on a rigorous topological mathematical model. The HodgeRank algorithm, based on the Hodge decomposition theorem, gives us an avenue to quantify these self-interactions. The HodgeRank algorithm was previously used to rank imcomplete or imbalanced data from several e-commerce sites and movies from Netflix. We first show that HodgeRank can also be used to successfully quantify the ``curvedness'' of different biomolecules, such as modelling the protein folding process and comparing biomolecules of different scales and complexities. We then turn our attention back to Hi-C data, which encompasses TADs/TAD regions. We show that under a suitable metric, HodgeRank can be used to quantify the self-interactions within each TAD region of Chromosome 10, each of these regions generated by an existing TAD calling method.
Solar power, a renewable source of energy, plays a significant role in allowing us to reduce our dependence on fossil fuels. Solar cells allow us to harness solar power by converting light energy from the Sun to electrical power through the photovoltaic effect, where light energy excites an electron, allowing it to reach a higher energy state. One of the types of materials that are used to make these solar cells are hybrid organic-inorganic perovskites (HOIPs). Not only have HOIPs been projected to be one of the most cost-effective options for future solar cells, its efficiency levels have risen from 5% to 25% in the last decade.
Current machine learning-based perovskite designs rely heavily on the prediction of the bandgap of HOIPs. We show that the PerSpect-ML model, which is based on the generation of machine learning features using the eigenvalues of the Hodge Laplacian matrices, and previously applied to protein-ligand binding affinity prediction to great success, can be applied to the prediction of the bandgap of a comprehensive data set of hybrid organic-inorganic perovskites. We show that the resulting machine learning model not only significantly reduces the computational costs of current models, but also is superior in terms of overall predictive ability. |
---|