PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH

Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic ba...

Full description

Saved in:
Bibliographic Details
Main Author: Susanty, Meredita
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/86561
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:86561
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic bacteria (thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can withstand extreme conditions according to their habitats. The stability of these extremophilic proteins has drawn the interest of researchers because they can be utilized to catalyze industrial processes, especially those involving high temperatures, high salt concentrations, acidic/base conditions, and other extreme conditions. Besides being isolated from extremophilic bacteria, extremophilic proteins can also be obtained through engineering. However, to transform regular proteins into extremophiles, information about the features of the amino acids composing the protein that need to be altered is required. Currently, among various types of extremophilic proteins, only thermophilic proteins have been intensively studied, followed by halophilic proteins, while other extremophilic proteins are still underexplored due to the limited available datasets. Currently, research to identify thermophilic and halophilic proteins is conducted using in silico approaches as an alternative to time-consuming and expensive experimental methods. The in-silico method involves extracting various features from amino acid sequences. These features are then manually selected (handcrafted features) to serve as inputs for machine learning models. The time-consuming extraction process, the need for expertise in proteomics, and the subjective nature of the selection process make the scalability of handcrafted features approach low. Although deep learning approaches can achieve good kinerjance in classification and other predictions, they are considered black boxes, making the interpretation of their classifications difficult. However, besides identifying extremophilic proteins, researchers in protein engineering also need information about unique features that enable proteins to survive in extreme conditions as a basis for protein design and engineering. This study attempts to overcome the challenges of limited data and manual feature selection by adopting a transfer learning approach in the natural language processing (NLP) domain. The use of pre-trained language models (LM) and the data representations generated by LM in the proteomic domain and neural network-based classification successfully improved the model's kinerjance in identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089 acidophilic proteins, the model achieved notably high accuracy, F1-score, and MCC values when employing embeddings as input. The best model for thermophilic classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic 0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and MCC values. The research results also indicate that raw embeddings can capture the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic proteins during pre-training. Furthermore, the comparison between embeddings models and fine-tuned models shows that supervised fine-tuned pLMs enhance model performance. For multi-class classification tasks, the fine-tuned ProtT5 model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the embeddings model with an accuracy of 0.67 and an F1-score of 0.53. In addition to identifying extremophilic proteins, this research also aims to interpret the classification model and obtain important decision-determining features. The DeepSHAP method using Shapley Value is employed to interpret the classification model using deep learning with a relatively complex architecture. The interpretation results are consistent with experimental-based research. This interpretation contributes to bridging the gap between the complexity of deep learning models and human understanding, facilitating the development of more reliable and interpretable models in proteomic research.
format Dissertations
author Susanty, Meredita
spellingShingle Susanty, Meredita
PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
author_facet Susanty, Meredita
author_sort Susanty, Meredita
title PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_short PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_full PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_fullStr PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_full_unstemmed PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_sort protein extremophilic properties classification using deep learning approach
url https://digilib.itb.ac.id/gdl/view/86561
_version_ 1822283449553125376
spelling id-itb.:865612024-11-15T09:56:52ZPROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH Susanty, Meredita Indonesia Dissertations extremophilic protein, classification, feature importance, deep learning. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86561 Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic bacteria (thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can withstand extreme conditions according to their habitats. The stability of these extremophilic proteins has drawn the interest of researchers because they can be utilized to catalyze industrial processes, especially those involving high temperatures, high salt concentrations, acidic/base conditions, and other extreme conditions. Besides being isolated from extremophilic bacteria, extremophilic proteins can also be obtained through engineering. However, to transform regular proteins into extremophiles, information about the features of the amino acids composing the protein that need to be altered is required. Currently, among various types of extremophilic proteins, only thermophilic proteins have been intensively studied, followed by halophilic proteins, while other extremophilic proteins are still underexplored due to the limited available datasets. Currently, research to identify thermophilic and halophilic proteins is conducted using in silico approaches as an alternative to time-consuming and expensive experimental methods. The in-silico method involves extracting various features from amino acid sequences. These features are then manually selected (handcrafted features) to serve as inputs for machine learning models. The time-consuming extraction process, the need for expertise in proteomics, and the subjective nature of the selection process make the scalability of handcrafted features approach low. Although deep learning approaches can achieve good kinerjance in classification and other predictions, they are considered black boxes, making the interpretation of their classifications difficult. However, besides identifying extremophilic proteins, researchers in protein engineering also need information about unique features that enable proteins to survive in extreme conditions as a basis for protein design and engineering. This study attempts to overcome the challenges of limited data and manual feature selection by adopting a transfer learning approach in the natural language processing (NLP) domain. The use of pre-trained language models (LM) and the data representations generated by LM in the proteomic domain and neural network-based classification successfully improved the model's kinerjance in identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089 acidophilic proteins, the model achieved notably high accuracy, F1-score, and MCC values when employing embeddings as input. The best model for thermophilic classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic 0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and MCC values. The research results also indicate that raw embeddings can capture the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic proteins during pre-training. Furthermore, the comparison between embeddings models and fine-tuned models shows that supervised fine-tuned pLMs enhance model performance. For multi-class classification tasks, the fine-tuned ProtT5 model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the embeddings model with an accuracy of 0.67 and an F1-score of 0.53. In addition to identifying extremophilic proteins, this research also aims to interpret the classification model and obtain important decision-determining features. The DeepSHAP method using Shapley Value is employed to interpret the classification model using deep learning with a relatively complex architecture. The interpretation results are consistent with experimental-based research. This interpretation contributes to bridging the gap between the complexity of deep learning models and human understanding, facilitating the development of more reliable and interpretable models in proteomic research. text