Integrated framework with association analysis for gene selection in microarray data classification

Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable tas...

Full description

Saved in:

Bibliographic Details
Main Author:	Ong, Huey Fang
Format:	Thesis
Language:	English English
Published:	2011
Online Access:	http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf http://psasir.upm.edu.my/id/eprint/27711/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Putra Malaysia
Language:	English English

id	my.upm.eprints.27711
record_format	eprints
spelling	my.upm.eprints.277112014-04-10T04:22:58Z http://psasir.upm.edu.my/id/eprint/27711/ Integrated framework with association analysis for gene selection in microarray data classification Ong, Huey Fang Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation. 2011-04 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf Ong, Huey Fang (2011) Integrated framework with association analysis for gene selection in microarray data classification. Masters thesis, Universiti Putra Malaysia. English
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
language	English English
description	Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation.
format	Thesis
author	Ong, Huey Fang
spellingShingle	Ong, Huey Fang Integrated framework with association analysis for gene selection in microarray data classification
author_facet	Ong, Huey Fang
author_sort	Ong, Huey Fang
title	Integrated framework with association analysis for gene selection in microarray data classification
title_short	Integrated framework with association analysis for gene selection in microarray data classification
title_full	Integrated framework with association analysis for gene selection in microarray data classification
title_fullStr	Integrated framework with association analysis for gene selection in microarray data classification
title_full_unstemmed	Integrated framework with association analysis for gene selection in microarray data classification
title_sort	integrated framework with association analysis for gene selection in microarray data classification
publishDate	2011
url	http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf http://psasir.upm.edu.my/id/eprint/27711/
_version_	1643829257941549056

Integrated framework with association analysis for gene selection in microarray data classification

Similar Items