Prediction of neutralising antibodies for novel coronavirus with machine learning

Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe r...

Full description

Saved in:
Bibliographic Details
Main Author: Kho, Jordon Junyang
Other Authors: Kwoh Chee Keong
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166683
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Coronaviruses were responsible for three major viral outbreaks since the beginning of the 21st century, with the most recent outbreak being the coronavirus disease 2019 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Coronavirus infections are known to cause severe respiratory disease and even death. Unfortunately, there is no effective drug or treatment to prevent and treat the infection. While neutralising antibodies have the potential to prevent future infections, traditional lab-based methods are often too time-consuming and expensive. Hence, machine learning approaches have become increasingly popular for expediting and complementing lab-based methods in the search for potential antibody candidates. This project investigated the utility of graph features for the discovery of potential neutralising SARS-CoV-2 antibodies. Tree-based models and other traditional classifiers were trained on mean pooling and max pooling graph features and their predictive performance were compared to those of baseline Extended Connectivity Fingerprints (ECFPs) models. As the data set suffered from class imbalance, Synthetic Minority Oversampling Technique (SMOTE) and Synthetic Minority Oversampling Technique (SMOTE-N) – Nominal were applied to oversample minority data points. The best performing models were mean pooling models trained using SMOTE-N with accuracies of up to 82% and F1 scores of up to 84% after hyper-parameter tuning. Mean pooling could capture sequence information more accurately than max pooling and SMOTE-N was found to be more compatible with graph features than SMOTE as the latter was more susceptible to noise generation. Furthermore, graph features were more interpretable and more compatible with oversampling techniques as compared to molecular fingerprints. However, the models were poor at correctly classifying the non-neutralising sequences and had false positive rates as high as 41%. Therefore, the exploration of other oversampling techniques in combination with undersampling techniques and the experimentation of different pooling approaches to capture atomic information more accurately could serve as new directions in future work.