PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic ba...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86561 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:86561 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Proteins can be denatured under extreme conditions (high temperature,
acid/alkaline pH, high salinity) due to breaking non-covalent interaction that
stabilize the native conformation of proteins, leading to the loss of function.
However, there is a group of proteins isolated from extremophilic bacteria
(thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can
withstand extreme conditions according to their habitats. The stability of these
extremophilic proteins has drawn the interest of researchers because they can be
utilized to catalyze industrial processes, especially those involving high
temperatures, high salt concentrations, acidic/base conditions, and other extreme
conditions.
Besides being isolated from extremophilic bacteria, extremophilic proteins can also
be obtained through engineering. However, to transform regular proteins into
extremophiles, information about the features of the amino acids composing the
protein that need to be altered is required. Currently, among various types of
extremophilic proteins, only thermophilic proteins have been intensively studied,
followed by halophilic proteins, while other extremophilic proteins are still
underexplored due to the limited available datasets.
Currently, research to identify thermophilic and halophilic proteins is conducted
using in silico approaches as an alternative to time-consuming and expensive
experimental methods. The in-silico method involves extracting various features
from amino acid sequences. These features are then manually selected (handcrafted
features) to serve as inputs for machine learning models. The time-consuming
extraction process, the need for expertise in proteomics, and the subjective nature
of the selection process make the scalability of handcrafted features approach low.
Although deep learning approaches can achieve good kinerjance in classification
and other predictions, they are considered black boxes, making the interpretation
of their classifications difficult. However, besides identifying extremophilic
proteins, researchers in protein engineering also need information about unique
features that enable proteins to survive in extreme conditions as a basis for protein
design and engineering.
This study attempts to overcome the challenges of limited data and manual feature
selection by adopting a transfer learning approach in the natural language
processing (NLP) domain. The use of pre-trained language models (LM) and the
data representations generated by LM in the proteomic domain and neural
network-based classification successfully improved the model's kinerjance in
identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset
comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089
acidophilic proteins, the model achieved notably high accuracy, F1-score, and
MCC values when employing embeddings as input. The best model for thermophilic
classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic
0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and
MCC values. The research results also indicate that raw embeddings can capture
the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic
proteins during pre-training. Furthermore, the comparison between embeddings
models and fine-tuned models shows that supervised fine-tuned pLMs enhance
model performance. For multi-class classification tasks, the fine-tuned ProtT5
model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the
embeddings model with an accuracy of 0.67 and an F1-score of 0.53.
In addition to identifying extremophilic proteins, this research also aims to
interpret the classification model and obtain important decision-determining
features. The DeepSHAP method using Shapley Value is employed to interpret the
classification model using deep learning with a relatively complex architecture.
The interpretation results are consistent with experimental-based research. This
interpretation contributes to bridging the gap between the complexity of deep
learning models and human understanding, facilitating the development of more
reliable and interpretable models in proteomic research. |
format |
Dissertations |
author |
Susanty, Meredita |
spellingShingle |
Susanty, Meredita PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
author_facet |
Susanty, Meredita |
author_sort |
Susanty, Meredita |
title |
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
title_short |
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
title_full |
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
title_fullStr |
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
title_full_unstemmed |
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH |
title_sort |
protein extremophilic properties classification using deep learning approach |
url |
https://digilib.itb.ac.id/gdl/view/86561 |
_version_ |
1822283449553125376 |
spelling |
id-itb.:865612024-11-15T09:56:52ZPROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH Susanty, Meredita Indonesia Dissertations extremophilic protein, classification, feature importance, deep learning. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86561 Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic bacteria (thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can withstand extreme conditions according to their habitats. The stability of these extremophilic proteins has drawn the interest of researchers because they can be utilized to catalyze industrial processes, especially those involving high temperatures, high salt concentrations, acidic/base conditions, and other extreme conditions. Besides being isolated from extremophilic bacteria, extremophilic proteins can also be obtained through engineering. However, to transform regular proteins into extremophiles, information about the features of the amino acids composing the protein that need to be altered is required. Currently, among various types of extremophilic proteins, only thermophilic proteins have been intensively studied, followed by halophilic proteins, while other extremophilic proteins are still underexplored due to the limited available datasets. Currently, research to identify thermophilic and halophilic proteins is conducted using in silico approaches as an alternative to time-consuming and expensive experimental methods. The in-silico method involves extracting various features from amino acid sequences. These features are then manually selected (handcrafted features) to serve as inputs for machine learning models. The time-consuming extraction process, the need for expertise in proteomics, and the subjective nature of the selection process make the scalability of handcrafted features approach low. Although deep learning approaches can achieve good kinerjance in classification and other predictions, they are considered black boxes, making the interpretation of their classifications difficult. However, besides identifying extremophilic proteins, researchers in protein engineering also need information about unique features that enable proteins to survive in extreme conditions as a basis for protein design and engineering. This study attempts to overcome the challenges of limited data and manual feature selection by adopting a transfer learning approach in the natural language processing (NLP) domain. The use of pre-trained language models (LM) and the data representations generated by LM in the proteomic domain and neural network-based classification successfully improved the model's kinerjance in identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089 acidophilic proteins, the model achieved notably high accuracy, F1-score, and MCC values when employing embeddings as input. The best model for thermophilic classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic 0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and MCC values. The research results also indicate that raw embeddings can capture the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic proteins during pre-training. Furthermore, the comparison between embeddings models and fine-tuned models shows that supervised fine-tuned pLMs enhance model performance. For multi-class classification tasks, the fine-tuned ProtT5 model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the embeddings model with an accuracy of 0.67 and an F1-score of 0.53. In addition to identifying extremophilic proteins, this research also aims to interpret the classification model and obtain important decision-determining features. The DeepSHAP method using Shapley Value is employed to interpret the classification model using deep learning with a relatively complex architecture. The interpretation results are consistent with experimental-based research. This interpretation contributes to bridging the gap between the complexity of deep learning models and human understanding, facilitating the development of more reliable and interpretable models in proteomic research. text |