Enhanced prediction of several protein structural attributes with machine learning algorithms

The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the numb...

Full description

Saved in:

Bibliographic Details
Main Author:	Yang, Jianyi
Other Authors:	Chen Xin
Format:	Theses and Dissertations
Language:	English
Published:	2012
Subjects:	DRNTU::Science::Mathematics::Applied mathematics::Information theory
Online Access:	https://hdl.handle.net/10356/48058
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-48058
record_format	dspace
spelling	sg-ntu-dr.10356-480582023-02-28T23:52:43Z Enhanced prediction of several protein structural attributes with machine learning algorithms Yang, Jianyi Chen Xin School of Physical and Mathematical Sciences DRNTU::Science::Mathematics::Applied mathematics::Information theory The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the number of structure-known proteins is widening rapidly. In-silico prediction of protein structure from amino acid sequence has the potential to bridge this gap. This thesis presents the machine learning-based computational methods that we developed to predict four protein structural attributes: (1) protein structural class, (2) protein fold, (3) G-protein-coupled receptors, and (4) protein contact map. First, for protein structural class prediction, we propose to use the chaos game representation and recurrence quantification analysis to extract a set of features directly from the amino acid sequences. Fisher's discriminant algorithm is adopted as the classification algorithm, and about 65% overall accuracy is achieved for proteins from low-similarity datasets. Comparisons with other methods (that use the same kind of input information) show that the proposed method has higher or comparable accuracy depending on different datasets tested. When the similar idea is applied using the predicted protein secondary structure to predict the class, the resulting prediction accuracy could exceed 80%. Second, for taxonomy-based protein fold recognition, a new method named TAXFOLD is proposed by extracting a comprehensive set of global and local features from the PSI-BLAST and PSIPRED profiles. These features are then fed into support vector machine to make fold recognition. Experimental tests on seven datasets demonstrate that TAXFOLD makes an average 6.9% improvement over the best available taxonomic method and performs comparably well with the best conventional template-based fold recognition methods. Third, for hierarchical classification of GPCRs, we develop a new method named PCA-GPCR that could classify GPCRs at all the five levels of the GPCR classification hierarchy. It relies on a comprehensive set of 1497 sequence-derived features. Because the number of dimensions of the feature space is very high, the principal component analysis is employed to reduce the dimensionality to 32. Jackknife tests on a large dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Experimental comparisons show that PCA-GPCR consistently outperforms the BLAST-based classification and other competing predictors. At last, for protein contact map prediction, a consensus approach named LRcon is proposed to improve the performance of existing predictors. Our new approach combines the prediction results from several complementary predictors by using a logistic regression model. Tests on the targets from the recent CASP9 experiment and a large dataset consisting of 856 protein chains show that LRcon not only outperforms its component predictors but also simple averaging and voting schemes. DOCTOR OF PHILOSOPHY (SPMS) 2012-03-01T00:54:55Z 2012-03-01T00:54:55Z 2012 2012 Thesis Yang, J. (2012). Enhanced prediction of several protein structural attributes with machine learning algorithms. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/48058 10.32657/10356/48058 en 189 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Science::Mathematics::Applied mathematics::Information theory
spellingShingle	DRNTU::Science::Mathematics::Applied mathematics::Information theory Yang, Jianyi Enhanced prediction of several protein structural attributes with machine learning algorithms
description	The classical sequence-structure-function paradigm for proteins illustrates that the amino acid sequence of a protein determines its three-dimensional (3D) structure and function. With the great success of genome sequencing projects, the gap between the number of sequence-known proteins and the number of structure-known proteins is widening rapidly. In-silico prediction of protein structure from amino acid sequence has the potential to bridge this gap. This thesis presents the machine learning-based computational methods that we developed to predict four protein structural attributes: (1) protein structural class, (2) protein fold, (3) G-protein-coupled receptors, and (4) protein contact map. First, for protein structural class prediction, we propose to use the chaos game representation and recurrence quantification analysis to extract a set of features directly from the amino acid sequences. Fisher's discriminant algorithm is adopted as the classification algorithm, and about 65% overall accuracy is achieved for proteins from low-similarity datasets. Comparisons with other methods (that use the same kind of input information) show that the proposed method has higher or comparable accuracy depending on different datasets tested. When the similar idea is applied using the predicted protein secondary structure to predict the class, the resulting prediction accuracy could exceed 80%. Second, for taxonomy-based protein fold recognition, a new method named TAXFOLD is proposed by extracting a comprehensive set of global and local features from the PSI-BLAST and PSIPRED profiles. These features are then fed into support vector machine to make fold recognition. Experimental tests on seven datasets demonstrate that TAXFOLD makes an average 6.9% improvement over the best available taxonomic method and performs comparably well with the best conventional template-based fold recognition methods. Third, for hierarchical classification of GPCRs, we develop a new method named PCA-GPCR that could classify GPCRs at all the five levels of the GPCR classification hierarchy. It relies on a comprehensive set of 1497 sequence-derived features. Because the number of dimensions of the feature space is very high, the principal component analysis is employed to reduce the dimensionality to 32. Jackknife tests on a large dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Experimental comparisons show that PCA-GPCR consistently outperforms the BLAST-based classification and other competing predictors. At last, for protein contact map prediction, a consensus approach named LRcon is proposed to improve the performance of existing predictors. Our new approach combines the prediction results from several complementary predictors by using a logistic regression model. Tests on the targets from the recent CASP9 experiment and a large dataset consisting of 856 protein chains show that LRcon not only outperforms its component predictors but also simple averaging and voting schemes.
author2	Chen Xin
author_facet	Chen Xin Yang, Jianyi
format	Theses and Dissertations
author	Yang, Jianyi
author_sort	Yang, Jianyi
title	Enhanced prediction of several protein structural attributes with machine learning algorithms
title_short	Enhanced prediction of several protein structural attributes with machine learning algorithms
title_full	Enhanced prediction of several protein structural attributes with machine learning algorithms
title_fullStr	Enhanced prediction of several protein structural attributes with machine learning algorithms
title_full_unstemmed	Enhanced prediction of several protein structural attributes with machine learning algorithms
title_sort	enhanced prediction of several protein structural attributes with machine learning algorithms
publishDate	2012
url	https://hdl.handle.net/10356/48058
_version_	1759856919077453824

Enhanced prediction of several protein structural attributes with machine learning algorithms

Similar Items