An improved classification of G-protein-coupled receptors using sequence-derived features
Background: G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the ide...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/100530 http://hdl.handle.net/10220/17875 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Background: G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the
targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however,
a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel
receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the
protein primary sequences.
Results: We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497
sequence-derived features. The principal component analysis is first employed to reduce the dimension of the
feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful
classification algorithm, called intimate sorting, to predict GPCRs at five levels. The prediction at the first level
determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further
predicted into certain family, subfamily, sub-subfamily and subtype by the classifiers at the second, third, fourth, and
fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully
constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife
tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth)
can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a
dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the
fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing
methods to be compared.
Conclusions: The comprehensive set of 1497 features is believed to be capable of capturing information about
amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high
accuracies are achieved when predicting GPCRs at all the five levels with our proposed method. |
---|