Effect of training datasets on support vector machine prediction of protein-protein interactions

Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffle...

Full description

Saved in:
Bibliographic Details
Main Authors: LO, Siaw Ling, CAI, Cong Zhong, CHUNG, Maxey, CHEN, Yu Zong
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2005
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4874
https://ink.library.smu.edu.sg/context/sis_research/article/5877/viewcontent/Effect___PV.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5877
record_format dspace
spelling sg-smu-ink.sis_research-58772020-02-13T08:48:14Z Effect of training datasets on support vector machine prediction of protein-protein interactions LO, Siaw Ling CAI, Cong Zhong CHUNG, Maxey CHEN, Yu Zong Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction. 2005-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4874 info:doi/10.1002/pmic.200401118 https://ink.library.smu.edu.sg/context/sis_research/article/5877/viewcontent/Effect___PV.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Database of interacting proteins Protein function prediction Protein-protein interaction prediction Shuffled sequence Support vector machine SVMlight Computer Engineering Data Storage Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Database of interacting proteins
Protein function prediction
Protein-protein interaction prediction
Shuffled sequence
Support vector machine
SVMlight
Computer Engineering
Data Storage Systems
spellingShingle Database of interacting proteins
Protein function prediction
Protein-protein interaction prediction
Shuffled sequence
Support vector machine
SVMlight
Computer Engineering
Data Storage Systems
LO, Siaw Ling
CAI, Cong Zhong
CHUNG, Maxey
CHEN, Yu Zong
Effect of training datasets on support vector machine prediction of protein-protein interactions
description Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.
format text
author LO, Siaw Ling
CAI, Cong Zhong
CHUNG, Maxey
CHEN, Yu Zong
author_facet LO, Siaw Ling
CAI, Cong Zhong
CHUNG, Maxey
CHEN, Yu Zong
author_sort LO, Siaw Ling
title Effect of training datasets on support vector machine prediction of protein-protein interactions
title_short Effect of training datasets on support vector machine prediction of protein-protein interactions
title_full Effect of training datasets on support vector machine prediction of protein-protein interactions
title_fullStr Effect of training datasets on support vector machine prediction of protein-protein interactions
title_full_unstemmed Effect of training datasets on support vector machine prediction of protein-protein interactions
title_sort effect of training datasets on support vector machine prediction of protein-protein interactions
publisher Institutional Knowledge at Singapore Management University
publishDate 2005
url https://ink.library.smu.edu.sg/sis_research/4874
https://ink.library.smu.edu.sg/context/sis_research/article/5877/viewcontent/Effect___PV.pdf
_version_ 1770575080848359424