Classification of protein sequences in coronavirus and other human viruses
To date, the COVID-19 pandemic has resulted in over 6 millions deaths globally (World Health Organization [WHO], 2023) and infected more than 2 million of the population in Singapore (Ministry of Health Singapore [MOH], 2023). The profound impacts of the pan- demic cannot be overstated; beginn...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166448 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-166448 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1664482023-05-01T15:35:50Z Classification of protein sequences in coronavirus and other human viruses Lee, Ryan Kai Jun Fedor Duzhin School of Physical and Mathematical Sciences FDuzhin@ntu.edu.sg Science::Mathematics To date, the COVID-19 pandemic has resulted in over 6 millions deaths globally (World Health Organization [WHO], 2023) and infected more than 2 million of the population in Singapore (Ministry of Health Singapore [MOH], 2023). The profound impacts of the pan- demic cannot be overstated; beginning with the costs to individual lives and livelihoods to the slowdown of economic activity in any country unprepared to mitigate the spread of the pandemic (Gong et al. 2022). Despite the gradual lifting of restrictions related to air travel and safe distancing, alongside the recurrent release of vaccines targeted against the newest strains of COVID-19 variants, research in the field of applying machine learning and arti- ficial intelligence towards understanding virulent protein structures continue to be a work in progress. As pointed out by (Baker and Sali 2001), a complete understanding of the biological role of proteins requires the study of their structures and individual functions. Proteins are complex molecules comprising 20 amino acids joined together by peptide bonds (Alberts, 2002). A SARS-COV 2 virus comprises 4 important structural proteins-Envelope (E), Membrane (M), Spike (S), and nucleocapsid (N) (Gordon et al. 2020). This study seeks to focus on SARS-COV 2 E proteins because these small, integral membrane proteins determine the virus’s life cycle, including its assembly, budding, envelope formation, and pathogenesis (Schoeman and Fielding 2019). At the same time, I compare these E proteins against proteins from other related human viruses to determine similarities in their pathogen- esis. The genomic data encoded in these single stranded RNA proteins plays a critical role in virulency and it is widely believed that its protein sequence mutations are key determinants of building resistance to antiviral drugs (Bai, Zhong, and Gao 2021) An important open problem in biophysics is to understand why certain proteins fold to form homo-pentameric ion channels. These homo-pentameric cation channels are crucial to the virus’s pathogenicity (Verdiá-Báguena et al. 2012). This remains a relevant problem in understanding how the sequence alignment of proteins relate to its protein structures in the formulation of these antiviral drugs (Tomar and Arkin 2020). In this project, I suggest a statistical approach to this problem based on machine learning. Specifically, I train a number of human-interpretable machine learning models to predict whether a protein is able to form a homo-pentameric channel and identify features which are important for this formation to take place. This process is known as protein folding. Understanding these features can potentially provide insights to address the ‘protein folding problem’, which comprise a series of three sub-problems: what is the folding code, what is the folding mechanism, and whether we can predict the native structure of a protein from its amino acid sequence (Dill et al. 2008). Bachelor of Science in Mathematical Sciences and Economics 2023-04-26T08:22:12Z 2023-04-26T08:22:12Z 2023 Final Year Project (FYP) Lee, R. K. J. (2023). Classification of protein sequences in coronavirus and other human viruses. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166448 https://hdl.handle.net/10356/166448 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Mathematics |
spellingShingle |
Science::Mathematics Lee, Ryan Kai Jun Classification of protein sequences in coronavirus and other human viruses |
description |
To date, the COVID-19 pandemic has resulted in over 6 millions deaths globally (World
Health Organization [WHO], 2023) and infected more than 2 million of the population in
Singapore (Ministry of Health Singapore [MOH], 2023). The profound impacts of the pan-
demic cannot be overstated; beginning with the costs to individual lives and livelihoods to
the slowdown of economic activity in any country unprepared to mitigate the spread of the
pandemic (Gong et al. 2022). Despite the gradual lifting of restrictions related to air travel
and safe distancing, alongside the recurrent release of vaccines targeted against the newest
strains of COVID-19 variants, research in the field of applying machine learning and arti-
ficial intelligence towards understanding virulent protein structures continue to be a work
in progress. As pointed out by (Baker and Sali 2001), a complete understanding of the
biological role of proteins requires the study of their structures and individual functions.
Proteins are complex molecules comprising 20 amino acids joined together by peptide bonds
(Alberts, 2002). A SARS-COV 2 virus comprises 4 important structural proteins-Envelope
(E), Membrane (M), Spike (S), and nucleocapsid (N) (Gordon et al. 2020). This study
seeks to focus on SARS-COV 2 E proteins because these small, integral membrane proteins
determine the virus’s life cycle, including its assembly, budding, envelope formation, and
pathogenesis (Schoeman and Fielding 2019). At the same time, I compare these E proteins
against proteins from other related human viruses to determine similarities in their pathogen-
esis. The genomic data encoded in these single stranded RNA proteins plays a critical role in
virulency and it is widely believed that its protein sequence mutations are key determinants
of building resistance to antiviral drugs (Bai, Zhong, and Gao 2021)
An important open problem in biophysics is to understand why certain proteins fold to form
homo-pentameric ion channels. These homo-pentameric cation channels are crucial to the
virus’s pathogenicity (Verdiá-Báguena et al. 2012). This remains a relevant problem in
understanding how the sequence alignment of proteins relate to its protein structures in the
formulation of these antiviral drugs (Tomar and Arkin 2020). In this project, I suggest a
statistical approach to this problem based on machine learning. Specifically, I train a number
of human-interpretable machine learning models to predict whether a protein is able to form
a homo-pentameric channel and identify features which are important for this formation
to take place. This process is known as protein folding. Understanding these features can
potentially provide insights to address the ‘protein folding problem’, which comprise a series
of three sub-problems: what is the folding code, what is the folding mechanism, and whether
we can predict the native structure of a protein from its amino acid sequence (Dill et al.
2008). |
author2 |
Fedor Duzhin |
author_facet |
Fedor Duzhin Lee, Ryan Kai Jun |
format |
Final Year Project |
author |
Lee, Ryan Kai Jun |
author_sort |
Lee, Ryan Kai Jun |
title |
Classification of protein sequences in coronavirus and other human viruses |
title_short |
Classification of protein sequences in coronavirus and other human viruses |
title_full |
Classification of protein sequences in coronavirus and other human viruses |
title_fullStr |
Classification of protein sequences in coronavirus and other human viruses |
title_full_unstemmed |
Classification of protein sequences in coronavirus and other human viruses |
title_sort |
classification of protein sequences in coronavirus and other human viruses |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/166448 |
_version_ |
1765213822242521088 |