Positive-unlabeled learning for disease gene identification

Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang, Peng, Li, Xiaoli, Mei, Jian-Ping, Kwoh, Chee Keong, Ng, See-Kiong
Other Authors: School of Computer Engineering
Format: Article
Language:English
Published: 2013
Subjects:
Online Access:https://hdl.handle.net/10356/96132
http://hdl.handle.net/10220/10776
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-96132
record_format dspace
spelling sg-ntu-dr.10356-961322022-02-16T16:27:51Z Positive-unlabeled learning for disease gene identification Yang, Peng Li, Xiaoli Mei, Jian-Ping Kwoh, Chee Keong Ng, See-Kiong School of Computer Engineering Bioinformatics Research Centre DRNTU::Engineering::Computer science and engineering Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. 2013-06-27T03:18:55Z 2019-12-06T19:26:11Z 2013-06-27T03:18:55Z 2019-12-06T19:26:11Z 2012 2012 Journal Article Yang, P., Li, X. L., Mei, J.-P., Kwoh, C.-K., & Ng, S.-K. (2012). Positive-unlabeled learning for disease gene identification. Bioinformatics, 28(20), 2640-2647. 1367-4803 https://hdl.handle.net/10356/96132 http://hdl.handle.net/10220/10776 10.1093/bioinformatics/bts504 22923290 en Bioinformatics © 2012 The Author.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Yang, Peng
Li, Xiaoli
Mei, Jian-Ping
Kwoh, Chee Keong
Ng, See-Kiong
Positive-unlabeled learning for disease gene identification
description Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Yang, Peng
Li, Xiaoli
Mei, Jian-Ping
Kwoh, Chee Keong
Ng, See-Kiong
format Article
author Yang, Peng
Li, Xiaoli
Mei, Jian-Ping
Kwoh, Chee Keong
Ng, See-Kiong
author_sort Yang, Peng
title Positive-unlabeled learning for disease gene identification
title_short Positive-unlabeled learning for disease gene identification
title_full Positive-unlabeled learning for disease gene identification
title_fullStr Positive-unlabeled learning for disease gene identification
title_full_unstemmed Positive-unlabeled learning for disease gene identification
title_sort positive-unlabeled learning for disease gene identification
publishDate 2013
url https://hdl.handle.net/10356/96132
http://hdl.handle.net/10220/10776
_version_ 1725985543846625280