Robust feature selection for high-dimensional and small-sized gene expression data
One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene ex...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/48641 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene expression data, the commonly used feature selection algorithms encounter problems such as over-sensitivity to variations in training data, i.e. robustness issue. The aim of this thesis is to address the robustness issue of feature selection for the HDSS data. Firstly, a novel criteria normalization algorithm has been proposed for multi-criterion combination to improve robustness of feature ranking. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with slight perturbations in training samples when applied to HDSS data. A widely used strategy for solving the inconsistency problems is multi-criterion combination. But one crucial problem in multi-criterion combination is how to normalize feature scores from different criteria. In the thesis, a new feature importance transformation algorithm based on resampling and permutation is proposed for score normalization. Experimental studies on four popular gene expression data sets show that the multi-criterion combination based on the proposed score normalization produces gene rankings with improved robustness. Secondly, a multi-criterion fusion-based recursive feature elimination (MCF-RFE) algorithm has been developed with the goal of improving both classification performance and robustness of feature subset selection. Feature subset selection often aims to select a compact subset of features to build a pattern classifier with reduced complexity. From the perspective of pattern analysis, producing stable or robust solution is also a desired property of a feature subset selection algorithm. In the thesis, we analyze the robustness issue existing in feature subset selection for HDSS gene expression data, and propose the MCF-RFE algorithm. Experimental studies on five gene expression data sets show that the MCF-RFE algorithm outperforms the commonly used benchmark feature selection algorithm SVM-RFE. Thirdly, a new regularized linear discriminant analysis (LDA) based algorithm has been proposed for robust feature selection of HDSS data. When applied to gene expression data which usually have high dimensionality, small sample size and class imbalance (i.e. great discrepancy in the number of samples between classes), LDA-based feature selection encounters problems such as singularity of scatter matrix, overfitting, overwhelming and prohibitive computational complexity. In the thesis, we propose a new regularization technique giving more emphasis to minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In addition, an incremental implementation of LDA-based feature selection has been developed to reduce computational overhead. Comparative studies on five gene microarray problems show that LDA with the new regularization can produce gene subsets with excellent performance in both classification and robustness. Fourthly, a framework of reducing feature redundancy has been proposed. In order to improve compactness of feature subsets, the usual practice is to utmostly remove redundancy. But this strategy does not necessarily help in some problems such as HDSS gene expression data analysis. We argue that a moderate degree of feature redundancy should be retained in feature subset for improving both classification and robustness. The proposed framework based on this argument is flexible and can be applied to any feature ranking algorithms. We have implemented the framework with the popular feature ranking algorithm{Fisher's ratio, and conducted experiments on five gene expression data sets. The experimental studies show that the proposed framework produces feature subsets with improved classification performance and good performance in robustness. |
---|