Prediction of neutralising antibodies for novel coronavirus with machine learning

Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe r...

Full description

Saved in:
Bibliographic Details
Main Author: Kho, Jordon Junyang
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166683
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-166683
record_format dspace
spelling sg-ntu-dr.10356-1666832023-05-12T15:36:56Z Prediction of neutralising antibodies for novel coronavirus with machine learning Kho, Jordon Junyang Kwoh Chee Keong School of Computer Science and Engineering ASCKKWOH@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe respiratory disease and even death. Unfortunately, there is no effective drug or treatment to prevent and treat the infection. While neutralising antibodies have the potential to prevent future infections, traditional lab-based methods are often too time-consuming and expensive. Hence, machine learning approaches have become increasingly popular for expediting and complementing lab-based methods in the search for potential antibody candidates. This project investigated the utility of graph features for the discovery of potential neutralising SARS-CoV-2 antibodies. Tree-based models and other traditional classifiers were trained on mean pooling and max pooling graph features and their predictive performance were compared to those of baseline Extended Connectivity Fingerprints (ECFPs) models. As the data set suffered from class imbalance, Synthetic Minority Oversampling Technique (SMOTE) and Synthetic Minority Oversampling Technique (SMOTE-N) – Nominal were applied to oversample minority data points. The best performing models were mean pooling models trained using SMOTE-N with accuracies of up to 82% and F1 scores of up to 84% after hyper-parameter tuning. Mean pooling could capture sequence information more accurately than max pooling and SMOTE-N was found to be more compatible with graph features than SMOTE as the latter was more susceptible to noise generation. Furthermore, graph features were more interpretable and more compatible with oversampling techniques as compared to molecular fingerprints. However, the models were poor at correctly classifying the non-neutralising sequences and had false positive rates as high as 41%. Therefore, the exploration of other oversampling techniques in combination with undersampling techniques and the experimentation of different pooling approaches to capture atomic information more accurately could serve as new directions in future work. Bachelor of Engineering (Computer Science) 2023-05-09T05:21:43Z 2023-05-09T05:21:43Z 2023 Final Year Project (FYP) Kho, J. J. (2023). Prediction of neutralising antibodies for novel coronavirus with machine learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166683 https://hdl.handle.net/10356/166683 en SCSE22-0982 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Kho, Jordon Junyang
Prediction of neutralising antibodies for novel coronavirus with machine learning
description Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe respiratory disease and even death. Unfortunately, there is no effective drug or treatment to prevent and treat the infection. While neutralising antibodies have the potential to prevent future infections, traditional lab-based methods are often too time-consuming and expensive. Hence, machine learning approaches have become increasingly popular for expediting and complementing lab-based methods in the search for potential antibody candidates. This project investigated the utility of graph features for the discovery of potential neutralising SARS-CoV-2 antibodies. Tree-based models and other traditional classifiers were trained on mean pooling and max pooling graph features and their predictive performance were compared to those of baseline Extended Connectivity Fingerprints (ECFPs) models. As the data set suffered from class imbalance, Synthetic Minority Oversampling Technique (SMOTE) and Synthetic Minority Oversampling Technique (SMOTE-N) – Nominal were applied to oversample minority data points. The best performing models were mean pooling models trained using SMOTE-N with accuracies of up to 82% and F1 scores of up to 84% after hyper-parameter tuning. Mean pooling could capture sequence information more accurately than max pooling and SMOTE-N was found to be more compatible with graph features than SMOTE as the latter was more susceptible to noise generation. Furthermore, graph features were more interpretable and more compatible with oversampling techniques as compared to molecular fingerprints. However, the models were poor at correctly classifying the non-neutralising sequences and had false positive rates as high as 41%. Therefore, the exploration of other oversampling techniques in combination with undersampling techniques and the experimentation of different pooling approaches to capture atomic information more accurately could serve as new directions in future work.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Kho, Jordon Junyang
format Final Year Project
author Kho, Jordon Junyang
author_sort Kho, Jordon Junyang
title Prediction of neutralising antibodies for novel coronavirus with machine learning
title_short Prediction of neutralising antibodies for novel coronavirus with machine learning
title_full Prediction of neutralising antibodies for novel coronavirus with machine learning
title_fullStr Prediction of neutralising antibodies for novel coronavirus with machine learning
title_full_unstemmed Prediction of neutralising antibodies for novel coronavirus with machine learning
title_sort prediction of neutralising antibodies for novel coronavirus with machine learning
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/166683
_version_ 1770567061659975680