Semi-supervised multi-label collective classification ensemble for functional genomics

Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wu, Qingyao, Ye, Yunming, Ho, Shen-Shyang, Zhou, Shuigeng
Other Authors:	School of Computer Engineering
Format:	Article
Language:	English
Published:	2015
Online Access:	https://hdl.handle.net/10356/102885 http://hdl.handle.net/10220/38675
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-102885
record_format	dspace
spelling	sg-ntu-dr.10356-1028852022-02-16T16:26:50Z Semi-supervised multi-label collective classification ensemble for functional genomics Wu, Qingyao Ye, Yunming Ho, Shen-Shyang Zhou, Shuigeng School of Computer Engineering Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data. Results: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.Conclusion: Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms. Published version 2015-09-08T08:09:10Z 2019-12-06T21:01:38Z 2015-09-08T08:09:10Z 2019-12-06T21:01:38Z 2014 2014 Journal Article Wu, Q., Ye, Y., Ho, S.-S., & Zhou, S. (2014). Semi-supervised multi-label collective classification ensemble for functional genomics. BMC Genomics, 15(9), S17-. 1471-2164 https://hdl.handle.net/10356/102885 http://hdl.handle.net/10220/38675 10.1186/1471-2164-15-S9-S17 25521242 en BMC Genomics © 2014 Wu et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
description	Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data. Results: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.Conclusion: Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Wu, Qingyao Ye, Yunming Ho, Shen-Shyang Zhou, Shuigeng
format	Article
author	Wu, Qingyao Ye, Yunming Ho, Shen-Shyang Zhou, Shuigeng
spellingShingle	Wu, Qingyao Ye, Yunming Ho, Shen-Shyang Zhou, Shuigeng Semi-supervised multi-label collective classification ensemble for functional genomics
author_sort	Wu, Qingyao
title	Semi-supervised multi-label collective classification ensemble for functional genomics
title_short	Semi-supervised multi-label collective classification ensemble for functional genomics
title_full	Semi-supervised multi-label collective classification ensemble for functional genomics
title_fullStr	Semi-supervised multi-label collective classification ensemble for functional genomics
title_full_unstemmed	Semi-supervised multi-label collective classification ensemble for functional genomics
title_sort	semi-supervised multi-label collective classification ensemble for functional genomics
publishDate	2015
url	https://hdl.handle.net/10356/102885 http://hdl.handle.net/10220/38675
_version_	1725985619472023552

Semi-supervised multi-label collective classification ensemble for functional genomics

Similar Items