PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic ba...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86561 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Proteins can be denatured under extreme conditions (high temperature,
acid/alkaline pH, high salinity) due to breaking non-covalent interaction that
stabilize the native conformation of proteins, leading to the loss of function.
However, there is a group of proteins isolated from extremophilic bacteria
(thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can
withstand extreme conditions according to their habitats. The stability of these
extremophilic proteins has drawn the interest of researchers because they can be
utilized to catalyze industrial processes, especially those involving high
temperatures, high salt concentrations, acidic/base conditions, and other extreme
conditions.
Besides being isolated from extremophilic bacteria, extremophilic proteins can also
be obtained through engineering. However, to transform regular proteins into
extremophiles, information about the features of the amino acids composing the
protein that need to be altered is required. Currently, among various types of
extremophilic proteins, only thermophilic proteins have been intensively studied,
followed by halophilic proteins, while other extremophilic proteins are still
underexplored due to the limited available datasets.
Currently, research to identify thermophilic and halophilic proteins is conducted
using in silico approaches as an alternative to time-consuming and expensive
experimental methods. The in-silico method involves extracting various features
from amino acid sequences. These features are then manually selected (handcrafted
features) to serve as inputs for machine learning models. The time-consuming
extraction process, the need for expertise in proteomics, and the subjective nature
of the selection process make the scalability of handcrafted features approach low.
Although deep learning approaches can achieve good kinerjance in classification
and other predictions, they are considered black boxes, making the interpretation
of their classifications difficult. However, besides identifying extremophilic
proteins, researchers in protein engineering also need information about unique
features that enable proteins to survive in extreme conditions as a basis for protein
design and engineering.
This study attempts to overcome the challenges of limited data and manual feature
selection by adopting a transfer learning approach in the natural language
processing (NLP) domain. The use of pre-trained language models (LM) and the
data representations generated by LM in the proteomic domain and neural
network-based classification successfully improved the model's kinerjance in
identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset
comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089
acidophilic proteins, the model achieved notably high accuracy, F1-score, and
MCC values when employing embeddings as input. The best model for thermophilic
classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic
0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and
MCC values. The research results also indicate that raw embeddings can capture
the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic
proteins during pre-training. Furthermore, the comparison between embeddings
models and fine-tuned models shows that supervised fine-tuned pLMs enhance
model performance. For multi-class classification tasks, the fine-tuned ProtT5
model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the
embeddings model with an accuracy of 0.67 and an F1-score of 0.53.
In addition to identifying extremophilic proteins, this research also aims to
interpret the classification model and obtain important decision-determining
features. The DeepSHAP method using Shapley Value is employed to interpret the
classification model using deep learning with a relatively complex architecture.
The interpretation results are consistent with experimental-based research. This
interpretation contributes to bridging the gap between the complexity of deep
learning models and human understanding, facilitating the development of more
reliable and interpretable models in proteomic research. |
---|