Robust feature selection for high-dimensional and small-sized gene expression data

One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene ex...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Feng
Other Authors: Mao Kezhi
Format: Theses and Dissertations
Language:English
Published: 2012
Subjects:
Online Access:https://hdl.handle.net/10356/48641
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-48641
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Yang, Feng
Robust feature selection for high-dimensional and small-sized gene expression data
description One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene expression data, the commonly used feature selection algorithms encounter problems such as over-sensitivity to variations in training data, i.e. robustness issue. The aim of this thesis is to address the robustness issue of feature selection for the HDSS data. Firstly, a novel criteria normalization algorithm has been proposed for multi-criterion combination to improve robustness of feature ranking. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with slight perturbations in training samples when applied to HDSS data. A widely used strategy for solving the inconsistency problems is multi-criterion combination. But one crucial problem in multi-criterion combination is how to normalize feature scores from different criteria. In the thesis, a new feature importance transformation algorithm based on resampling and permutation is proposed for score normalization. Experimental studies on four popular gene expression data sets show that the multi-criterion combination based on the proposed score normalization produces gene rankings with improved robustness. Secondly, a multi-criterion fusion-based recursive feature elimination (MCF-RFE) algorithm has been developed with the goal of improving both classification performance and robustness of feature subset selection. Feature subset selection often aims to select a compact subset of features to build a pattern classifier with reduced complexity. From the perspective of pattern analysis, producing stable or robust solution is also a desired property of a feature subset selection algorithm. In the thesis, we analyze the robustness issue existing in feature subset selection for HDSS gene expression data, and propose the MCF-RFE algorithm. Experimental studies on five gene expression data sets show that the MCF-RFE algorithm outperforms the commonly used benchmark feature selection algorithm SVM-RFE. Thirdly, a new regularized linear discriminant analysis (LDA) based algorithm has been proposed for robust feature selection of HDSS data. When applied to gene expression data which usually have high dimensionality, small sample size and class imbalance (i.e. great discrepancy in the number of samples between classes), LDA-based feature selection encounters problems such as singularity of scatter matrix, overfitting, overwhelming and prohibitive computational complexity. In the thesis, we propose a new regularization technique giving more emphasis to minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In addition, an incremental implementation of LDA-based feature selection has been developed to reduce computational overhead. Comparative studies on five gene microarray problems show that LDA with the new regularization can produce gene subsets with excellent performance in both classification and robustness. Fourthly, a framework of reducing feature redundancy has been proposed. In order to improve compactness of feature subsets, the usual practice is to utmostly remove redundancy. But this strategy does not necessarily help in some problems such as HDSS gene expression data analysis. We argue that a moderate degree of feature redundancy should be retained in feature subset for improving both classification and robustness. The proposed framework based on this argument is flexible and can be applied to any feature ranking algorithms. We have implemented the framework with the popular feature ranking algorithm{Fisher's ratio, and conducted experiments on five gene expression data sets. The experimental studies show that the proposed framework produces feature subsets with improved classification performance and good performance in robustness.
author2 Mao Kezhi
author_facet Mao Kezhi
Yang, Feng
format Theses and Dissertations
author Yang, Feng
author_sort Yang, Feng
title Robust feature selection for high-dimensional and small-sized gene expression data
title_short Robust feature selection for high-dimensional and small-sized gene expression data
title_full Robust feature selection for high-dimensional and small-sized gene expression data
title_fullStr Robust feature selection for high-dimensional and small-sized gene expression data
title_full_unstemmed Robust feature selection for high-dimensional and small-sized gene expression data
title_sort robust feature selection for high-dimensional and small-sized gene expression data
publishDate 2012
url https://hdl.handle.net/10356/48641
_version_ 1772825500593422336
spelling sg-ntu-dr.10356-486412023-07-04T16:46:18Z Robust feature selection for high-dimensional and small-sized gene expression data Yang, Feng Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene expression data, the commonly used feature selection algorithms encounter problems such as over-sensitivity to variations in training data, i.e. robustness issue. The aim of this thesis is to address the robustness issue of feature selection for the HDSS data. Firstly, a novel criteria normalization algorithm has been proposed for multi-criterion combination to improve robustness of feature ranking. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with slight perturbations in training samples when applied to HDSS data. A widely used strategy for solving the inconsistency problems is multi-criterion combination. But one crucial problem in multi-criterion combination is how to normalize feature scores from different criteria. In the thesis, a new feature importance transformation algorithm based on resampling and permutation is proposed for score normalization. Experimental studies on four popular gene expression data sets show that the multi-criterion combination based on the proposed score normalization produces gene rankings with improved robustness. Secondly, a multi-criterion fusion-based recursive feature elimination (MCF-RFE) algorithm has been developed with the goal of improving both classification performance and robustness of feature subset selection. Feature subset selection often aims to select a compact subset of features to build a pattern classifier with reduced complexity. From the perspective of pattern analysis, producing stable or robust solution is also a desired property of a feature subset selection algorithm. In the thesis, we analyze the robustness issue existing in feature subset selection for HDSS gene expression data, and propose the MCF-RFE algorithm. Experimental studies on five gene expression data sets show that the MCF-RFE algorithm outperforms the commonly used benchmark feature selection algorithm SVM-RFE. Thirdly, a new regularized linear discriminant analysis (LDA) based algorithm has been proposed for robust feature selection of HDSS data. When applied to gene expression data which usually have high dimensionality, small sample size and class imbalance (i.e. great discrepancy in the number of samples between classes), LDA-based feature selection encounters problems such as singularity of scatter matrix, overfitting, overwhelming and prohibitive computational complexity. In the thesis, we propose a new regularization technique giving more emphasis to minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In addition, an incremental implementation of LDA-based feature selection has been developed to reduce computational overhead. Comparative studies on five gene microarray problems show that LDA with the new regularization can produce gene subsets with excellent performance in both classification and robustness. Fourthly, a framework of reducing feature redundancy has been proposed. In order to improve compactness of feature subsets, the usual practice is to utmostly remove redundancy. But this strategy does not necessarily help in some problems such as HDSS gene expression data analysis. We argue that a moderate degree of feature redundancy should be retained in feature subset for improving both classification and robustness. The proposed framework based on this argument is flexible and can be applied to any feature ranking algorithms. We have implemented the framework with the popular feature ranking algorithm{Fisher's ratio, and conducted experiments on five gene expression data sets. The experimental studies show that the proposed framework produces feature subsets with improved classification performance and good performance in robustness. DOCTOR OF PHILOSOPHY (EEE) 2012-05-04T07:59:46Z 2012-05-04T07:59:46Z 2011 2011 Thesis Yang, F. (2011). Robust feature selection for high-dimensional and small-sized gene expression data. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/48641 10.32657/10356/48641 en 238 p. application/pdf