PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH

Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic ba...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Susanty, Meredita
التنسيق:	Dissertations
اللغة:	Indonesia
الوصول للمادة أونلاين:	https://digilib.itb.ac.id/gdl/view/86561
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	id-itb.:86561
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic bacteria (thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can withstand extreme conditions according to their habitats. The stability of these extremophilic proteins has drawn the interest of researchers because they can be utilized to catalyze industrial processes, especially those involving high temperatures, high salt concentrations, acidic/base conditions, and other extreme conditions. Besides being isolated from extremophilic bacteria, extremophilic proteins can also be obtained through engineering. However, to transform regular proteins into extremophiles, information about the features of the amino acids composing the protein that need to be altered is required. Currently, among various types of extremophilic proteins, only thermophilic proteins have been intensively studied, followed by halophilic proteins, while other extremophilic proteins are still underexplored due to the limited available datasets. Currently, research to identify thermophilic and halophilic proteins is conducted using in silico approaches as an alternative to time-consuming and expensive experimental methods. The in-silico method involves extracting various features from amino acid sequences. These features are then manually selected (handcrafted features) to serve as inputs for machine learning models. The time-consuming extraction process, the need for expertise in proteomics, and the subjective nature of the selection process make the scalability of handcrafted features approach low. Although deep learning approaches can achieve good kinerjance in classification and other predictions, they are considered black boxes, making the interpretation of their classifications difficult. However, besides identifying extremophilic proteins, researchers in protein engineering also need information about unique features that enable proteins to survive in extreme conditions as a basis for protein design and engineering. This study attempts to overcome the challenges of limited data and manual feature selection by adopting a transfer learning approach in the natural language processing (NLP) domain. The use of pre-trained language models (LM) and the data representations generated by LM in the proteomic domain and neural network-based classification successfully improved the model's kinerjance in identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089 acidophilic proteins, the model achieved notably high accuracy, F1-score, and MCC values when employing embeddings as input. The best model for thermophilic classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic 0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and MCC values. The research results also indicate that raw embeddings can capture the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic proteins during pre-training. Furthermore, the comparison between embeddings models and fine-tuned models shows that supervised fine-tuned pLMs enhance model performance. For multi-class classification tasks, the fine-tuned ProtT5 model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the embeddings model with an accuracy of 0.67 and an F1-score of 0.53. In addition to identifying extremophilic proteins, this research also aims to interpret the classification model and obtain important decision-determining features. The DeepSHAP method using Shapley Value is employed to interpret the classification model using deep learning with a relatively complex architecture. The interpretation results are consistent with experimental-based research. This interpretation contributes to bridging the gap between the complexity of deep learning models and human understanding, facilitating the development of more reliable and interpretable models in proteomic research.
format	Dissertations
author	Susanty, Meredita
spellingShingle	Susanty, Meredita PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
author_facet	Susanty, Meredita
author_sort	Susanty, Meredita
title	PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_short	PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_full	PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_fullStr	PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_full_unstemmed	PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH
title_sort	protein extremophilic properties classification using deep learning approach
url	https://digilib.itb.ac.id/gdl/view/86561
_version_	1823657853048586240
spelling	id-itb.:865612024-11-15T09:56:52ZPROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH Susanty, Meredita Indonesia Dissertations extremophilic protein, classification, feature importance, deep learning. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86561 Proteins can be denatured under extreme conditions (high temperature, acid/alkaline pH, high salinity) due to breaking non-covalent interaction that stabilize the native conformation of proteins, leading to the loss of function. However, there is a group of proteins isolated from extremophilic bacteria (thermophilic, halophilic, acidophilic, alkaliphilic, and barophilic) that can withstand extreme conditions according to their habitats. The stability of these extremophilic proteins has drawn the interest of researchers because they can be utilized to catalyze industrial processes, especially those involving high temperatures, high salt concentrations, acidic/base conditions, and other extreme conditions. Besides being isolated from extremophilic bacteria, extremophilic proteins can also be obtained through engineering. However, to transform regular proteins into extremophiles, information about the features of the amino acids composing the protein that need to be altered is required. Currently, among various types of extremophilic proteins, only thermophilic proteins have been intensively studied, followed by halophilic proteins, while other extremophilic proteins are still underexplored due to the limited available datasets. Currently, research to identify thermophilic and halophilic proteins is conducted using in silico approaches as an alternative to time-consuming and expensive experimental methods. The in-silico method involves extracting various features from amino acid sequences. These features are then manually selected (handcrafted features) to serve as inputs for machine learning models. The time-consuming extraction process, the need for expertise in proteomics, and the subjective nature of the selection process make the scalability of handcrafted features approach low. Although deep learning approaches can achieve good kinerjance in classification and other predictions, they are considered black boxes, making the interpretation of their classifications difficult. However, besides identifying extremophilic proteins, researchers in protein engineering also need information about unique features that enable proteins to survive in extreme conditions as a basis for protein design and engineering. This study attempts to overcome the challenges of limited data and manual feature selection by adopting a transfer learning approach in the natural language processing (NLP) domain. The use of pre-trained language models (LM) and the data representations generated by LM in the proteomic domain and neural network-based classification successfully improved the model's kinerjance in identifying extremophilic proteins, despite the limited dataset. Utilizing a dataset comprising 2,596 thermophilic, 5,018 halophilic, 1,002 alkaliphilic, and 4,089 acidophilic proteins, the model achieved notably high accuracy, F1-score, and MCC values when employing embeddings as input. The best model for thermophilic classification achieved 0.98; 0.98; 0.96, halophilic 0.92; 0.94; 0.8, alkaliphilic 0.89; 0.84; 0.75, acidophilic 0.9; 0.93; 0.75, respectively, for accuracy, F1, and MCC values. The research results also indicate that raw embeddings can capture the characteristics of thermophilic, halophilic, alkaliphilic, and acidophilic proteins during pre-training. Furthermore, the comparison between embeddings models and fine-tuned models shows that supervised fine-tuned pLMs enhance model performance. For multi-class classification tasks, the fine-tuned ProtT5 model achieved an accuracy of 0.70 and an F1-score of 0.57 compared to the embeddings model with an accuracy of 0.67 and an F1-score of 0.53. In addition to identifying extremophilic proteins, this research also aims to interpret the classification model and obtain important decision-determining features. The DeepSHAP method using Shapley Value is employed to interpret the classification model using deep learning with a relatively complex architecture. The interpretation results are consistent with experimental-based research. This interpretation contributes to bridging the gap between the complexity of deep learning models and human understanding, facilitating the development of more reliable and interpretable models in proteomic research. text

PROTEIN EXTREMOPHILIC PROPERTIES CLASSIFICATION USING DEEP LEARNING APPROACH

مواد مشابهة