Enhanced prediction of several protein structural attributes with machine learning algorithms

The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the numb...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Jianyi
Other Authors: Chen Xin
Format: Theses and Dissertations
Language:English
Published: 2012
Subjects:
Online Access:https://hdl.handle.net/10356/48058
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-48058
record_format dspace
spelling sg-ntu-dr.10356-480582023-02-28T23:52:43Z Enhanced prediction of several protein structural attributes with machine learning algorithms Yang, Jianyi Chen Xin School of Physical and Mathematical Sciences DRNTU::Science::Mathematics::Applied mathematics::Information theory The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the number of structure-known proteins is widening rapidly. In-silico prediction of protein structure from amino acid sequence has the potential to bridge this gap. This thesis presents the machine learning-based computational methods that we developed to predict four protein structural attributes: (1) protein structural class, (2) protein fold, (3) G-protein-coupled receptors, and (4) protein contact map. First, for protein structural class prediction, we propose to use the chaos game representation and recurrence quantification analysis to extract a set of features directly from the amino acid sequences. Fisher's discriminant algorithm is adopted as the classification algorithm, and about 65% overall accuracy is achieved for proteins from low-similarity datasets. Comparisons with other methods (that use the same kind of input information) show that the proposed method has higher or comparable accuracy depending on different datasets tested. When the similar idea is applied using the predicted protein secondary structure to predict the class, the resulting prediction accuracy could exceed 80%. Second, for taxonomy-based protein fold recognition, a new method named TAXFOLD is proposed by extracting a comprehensive set of global and local features from the PSI-BLAST and PSIPRED profiles. These features are then fed into support vector machine to make fold recognition. Experimental tests on seven datasets demonstrate that TAXFOLD makes an average 6.9% improvement over the best available taxonomic method and performs comparably well with the best conventional template-based fold recognition methods. Third, for hierarchical classification of GPCRs, we develop a new method named PCA-GPCR that could classify GPCRs at all the five levels of the GPCR classification hierarchy. It relies on a comprehensive set of 1497 sequence-derived features. Because the number of dimensions of the feature space is very high, the principal component analysis is employed to reduce the dimensionality to 32. Jackknife tests on a large dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Experimental comparisons show that PCA-GPCR consistently outperforms the BLAST-based classification and other competing predictors. At last, for protein contact map prediction, a consensus approach named LRcon is proposed to improve the performance of existing predictors. Our new approach combines the prediction results from several complementary predictors by using a logistic regression model. Tests on the targets from the recent CASP9 experiment and a large dataset consisting of 856 protein chains show that LRcon not only outperforms its component predictors but also simple averaging and voting schemes. DOCTOR OF PHILOSOPHY (SPMS) 2012-03-01T00:54:55Z 2012-03-01T00:54:55Z 2012 2012 Thesis Yang, J. (2012). Enhanced prediction of several protein structural attributes with machine learning algorithms. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/48058 10.32657/10356/48058 en 189 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Science::Mathematics::Applied mathematics::Information theory
spellingShingle DRNTU::Science::Mathematics::Applied mathematics::Information theory
Yang, Jianyi
Enhanced prediction of several protein structural attributes with machine learning algorithms
description The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the number of structure-known proteins is widening rapidly. In-silico prediction of protein structure from amino acid sequence has the potential to bridge this gap. This thesis presents the machine learning-based computational methods that we developed to predict four protein structural attributes: (1) protein structural class, (2) protein fold, (3) G-protein-coupled receptors, and (4) protein contact map. First, for protein structural class prediction, we propose to use the chaos game representation and recurrence quantification analysis to extract a set of features directly from the amino acid sequences. Fisher's discriminant algorithm is adopted as the classification algorithm, and about 65% overall accuracy is achieved for proteins from low-similarity datasets. Comparisons with other methods (that use the same kind of input information) show that the proposed method has higher or comparable accuracy depending on different datasets tested. When the similar idea is applied using the predicted protein secondary structure to predict the class, the resulting prediction accuracy could exceed 80%. Second, for taxonomy-based protein fold recognition, a new method named TAXFOLD is proposed by extracting a comprehensive set of global and local features from the PSI-BLAST and PSIPRED profiles. These features are then fed into support vector machine to make fold recognition. Experimental tests on seven datasets demonstrate that TAXFOLD makes an average 6.9% improvement over the best available taxonomic method and performs comparably well with the best conventional template-based fold recognition methods. Third, for hierarchical classification of GPCRs, we develop a new method named PCA-GPCR that could classify GPCRs at all the five levels of the GPCR classification hierarchy. It relies on a comprehensive set of 1497 sequence-derived features. Because the number of dimensions of the feature space is very high, the principal component analysis is employed to reduce the dimensionality to 32. Jackknife tests on a large dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Experimental comparisons show that PCA-GPCR consistently outperforms the BLAST-based classification and other competing predictors. At last, for protein contact map prediction, a consensus approach named LRcon is proposed to improve the performance of existing predictors. Our new approach combines the prediction results from several complementary predictors by using a logistic regression model. Tests on the targets from the recent CASP9 experiment and a large dataset consisting of 856 protein chains show that LRcon not only outperforms its component predictors but also simple averaging and voting schemes.
author2 Chen Xin
author_facet Chen Xin
Yang, Jianyi
format Theses and Dissertations
author Yang, Jianyi
author_sort Yang, Jianyi
title Enhanced prediction of several protein structural attributes with machine learning algorithms
title_short Enhanced prediction of several protein structural attributes with machine learning algorithms
title_full Enhanced prediction of several protein structural attributes with machine learning algorithms
title_fullStr Enhanced prediction of several protein structural attributes with machine learning algorithms
title_full_unstemmed Enhanced prediction of several protein structural attributes with machine learning algorithms
title_sort enhanced prediction of several protein structural attributes with machine learning algorithms
publishDate 2012
url https://hdl.handle.net/10356/48058
_version_ 1759856919077453824