Prediction of neutralising antibodies for novel coronavirus with machine learning
Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe r...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166683 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-166683 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1666832023-05-12T15:36:56Z Prediction of neutralising antibodies for novel coronavirus with machine learning Kho, Jordon Junyang Kwoh Chee Keong School of Computer Science and Engineering ASCKKWOH@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe respiratory disease and even death. Unfortunately, there is no effective drug or treatment to prevent and treat the infection. While neutralising antibodies have the potential to prevent future infections, traditional lab-based methods are often too time-consuming and expensive. Hence, machine learning approaches have become increasingly popular for expediting and complementing lab-based methods in the search for potential antibody candidates. This project investigated the utility of graph features for the discovery of potential neutralising SARS-CoV-2 antibodies. Tree-based models and other traditional classifiers were trained on mean pooling and max pooling graph features and their predictive performance were compared to those of baseline Extended Connectivity Fingerprints (ECFPs) models. As the data set suffered from class imbalance, Synthetic Minority Oversampling Technique (SMOTE) and Synthetic Minority Oversampling Technique (SMOTE-N) – Nominal were applied to oversample minority data points. The best performing models were mean pooling models trained using SMOTE-N with accuracies of up to 82% and F1 scores of up to 84% after hyper-parameter tuning. Mean pooling could capture sequence information more accurately than max pooling and SMOTE-N was found to be more compatible with graph features than SMOTE as the latter was more susceptible to noise generation. Furthermore, graph features were more interpretable and more compatible with oversampling techniques as compared to molecular fingerprints. However, the models were poor at correctly classifying the non-neutralising sequences and had false positive rates as high as 41%. Therefore, the exploration of other oversampling techniques in combination with undersampling techniques and the experimentation of different pooling approaches to capture atomic information more accurately could serve as new directions in future work. Bachelor of Engineering (Computer Science) 2023-05-09T05:21:43Z 2023-05-09T05:21:43Z 2023 Final Year Project (FYP) Kho, J. J. (2023). Prediction of neutralising antibodies for novel coronavirus with machine learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166683 https://hdl.handle.net/10356/166683 en SCSE22-0982 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Kho, Jordon Junyang Prediction of neutralising antibodies for novel coronavirus with machine learning |
description |
Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe respiratory disease and even death. Unfortunately, there is no effective drug or treatment to prevent and treat the infection. While neutralising antibodies have the potential to prevent future infections, traditional lab-based methods are often too time-consuming and expensive. Hence, machine learning approaches have become increasingly popular for expediting and complementing lab-based methods in the search for potential antibody candidates.
This project investigated the utility of graph features for the discovery of potential neutralising SARS-CoV-2 antibodies. Tree-based models and other traditional classifiers were trained on mean pooling and max pooling graph features and their predictive performance were compared to those of baseline Extended Connectivity Fingerprints (ECFPs) models. As the data set suffered from class imbalance, Synthetic Minority Oversampling Technique (SMOTE) and Synthetic Minority Oversampling Technique (SMOTE-N) – Nominal were applied to oversample minority data points. The best performing models were mean pooling models trained using SMOTE-N with accuracies of up to 82% and F1 scores of up to 84% after hyper-parameter tuning. Mean pooling could capture sequence information more accurately than max pooling and SMOTE-N was found to be more compatible with graph features than SMOTE as the latter was more susceptible to noise generation. Furthermore, graph features were more interpretable and more compatible with oversampling techniques as compared to molecular fingerprints. However, the models were poor at correctly classifying the non-neutralising sequences and had false positive rates as high as 41%. Therefore, the exploration of other oversampling techniques in combination with undersampling techniques and the experimentation of different pooling approaches to capture atomic information more accurately could serve as new directions in future work. |
author2 |
Kwoh Chee Keong |
author_facet |
Kwoh Chee Keong Kho, Jordon Junyang |
format |
Final Year Project |
author |
Kho, Jordon Junyang |
author_sort |
Kho, Jordon Junyang |
title |
Prediction of neutralising antibodies for novel coronavirus with machine learning |
title_short |
Prediction of neutralising antibodies for novel coronavirus with machine learning |
title_full |
Prediction of neutralising antibodies for novel coronavirus with machine learning |
title_fullStr |
Prediction of neutralising antibodies for novel coronavirus with machine learning |
title_full_unstemmed |
Prediction of neutralising antibodies for novel coronavirus with machine learning |
title_sort |
prediction of neutralising antibodies for novel coronavirus with machine learning |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/166683 |
_version_ |
1770567061659975680 |