PREDICTION OF DNA-BINDING PROTEINS BASED ON CAPSULE NETWORK METHOD
DNA-binding proteins (DBPs) are a group of proteins that carry out many important biological activities, including DNA replication, DNA damage repair, regulation of transcription, translation, and recombination. Several DBPs have been used as important targets in the development of cancer drugs, ant...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/81540 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | DNA-binding proteins (DBPs) are a group of proteins that carry out many important biological activities, including DNA replication, DNA damage repair, regulation of transcription, translation, and recombination. Several DBPs have been used as important targets in the development of cancer drugs, antibiotics, and steroids. Therefore, the identification of DBPs is of great importance in the pharmaceutical field for the development of new drugs targeting the performance control of proteins in this group. Through advances in DNA sequencing technology, proteomics data is rapidly increasing in data banks. However, less than 1% of the approximately 200 million proteins in the data bank have been annotated, including DBP. Several methods have been developed to identify DBPs through determining the structure of proteins in complex form with DNA for experimental annotation, including spectroscopic techniques, X-ray crystallography, and nuclear magnetic resonance (NMR). However, experimental methods require relatively high costs and take a long time to annotate the many proteins that fall into the DBP category. This motivates many researchers to develop automatic computational methods by utilizing protein sequences. To predict sequence-based DBPs, previous studies used multiple sequence alignment (MSA) techniques to extract evolutionary information (IE) that associates certain sequence similarities with DBP function. The majority of methods that have been developed use conventional machine learning-based computational methods to utilize IE as input. However, the use of IE in the form of complex features requires high computational costs, especially since several previous studies used multiple IE feature extraction techniques that used more than one variations of the PSSM method. In addition, conventional machine learning methods still involve human intervention and have limited performance in processing data with large sample sizes. This weakness means that the prediction process cannot run optimally.
In recent years, deep learning algorithms have been successfully applied to automatically predict proteins in the DBP family. Deep learning methods that have been used in DBP protein classification include convolutional neural networks (CNN) and recurrent neural networks (RNN). CNN has an algorithm that can extract features with various levels of complexity when classifying proteins. CNN can recognize simple features in lower layers and complex features in deeper layers, while RNN can capture contextual features from amino acid sequences. One of the disadvantages of deep learning algorithms is that their performance is less than optimal for datasets with a small number of samples. One way to overcome this weakness is to utilize a new deep learning algorithm known as the capsule network (CapsNet). CapsNet is designed to overcome some of the limitations of traditional object recognition using CNNs. The main
difference between CapsNet and others lies in the capsule, which is the basic unit. CapsNet has the ability to capture feature relationships between capsules using a dynamic routing algorithm. This method has been successfully implemented in many proteomics studies.
Additionally, advances in natural language processing (NLP) and the accessibility of supercomputers have supported pre-trained language models in the field of proteomics that enable models to learn sequence patterns, function, and structure. This pre-trained model provides important information about protein sequences in the form of embeddings that are proven to be effective in solving various prediction tasks, such as protein function prediction, protein function prediction, contact map prediction, and protein-protein interaction (PPI) prediction.
On the basis of the description above, this research designs two deep learning approaches to predict DBP group proteins with one protein representation technique, eliminating human intervention in the feature selection process, and working on two datasets with different numbers of samples. The first method used is to combine the Bi-LSTM and 1D-CapsNet algorithms with one-hot input encoding, which is shortened in this research to Bi-Caps-DBP. The second method is to utilize the 1D-CapsNet architecture with input protein sequence embedding (ProtT5, ESM-1b, and ESM-2), which is shortened to EmbedCaps-DBP. The input for training data and independent test data for the two methods above is protein sequences originating from two datasets with different samples, namely: PDB14189-PDB2272 and PDB1075-PDB186. Simulation results of BiCaps-DBP and EmbedCaps-DBP (ProtT5) for the independent test dataset PDB2272 were both shown to increase accuracy by 1.05% and 12.65%, respectively, compared to the Target-DBPPred method. On the independent test dataset PDB186, the EmbedCaps-DBP method (Prot T5) provides an accuracy of 84.73%. This result is 0.33% higher than the HKAM-MKM method, which uses the same dataset. Target-DBPPred and HKAM-MKM are conventional machine learning-based methods that use more than three variations of the PSSM method. These results indicate that the method used in this research is superior to the two reference methods. |
---|