Integrated framework with association analysis for gene selection in microarray data classification
Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable tas...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English |
Published: |
2011
|
Online Access: | http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf http://psasir.upm.edu.my/id/eprint/27711/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Putra Malaysia |
Language: | English English |
Summary: | Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation. |
---|