Robust feature selection for high-dimensional and small-sized gene expression data

One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene ex...

Full description

Saved in:

Bibliographic Details
Main Author:	Yang, Feng
Other Authors:	Mao Kezhi
Format:	Theses and Dissertations
Language:	English
Published:	2012
Subjects:	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Online Access:	https://hdl.handle.net/10356/48641
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-48641
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Yang, Feng Robust feature selection for high-dimensional and small-sized gene expression data
description	One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene expression data, the commonly used feature selection algorithms encounter problems such as over-sensitivity to variations in training data, i.e. robustness issue. The aim of this thesis is to address the robustness issue of feature selection for the HDSS data. Firstly, a novel criteria normalization algorithm has been proposed for multi-criterion combination to improve robustness of feature ranking. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with slight perturbations in training samples when applied to HDSS data. A widely used strategy for solving the inconsistency problems is multi-criterion combination. But one crucial problem in multi-criterion combination is how to normalize feature scores from different criteria. In the thesis, a new feature importance transformation algorithm based on resampling and permutation is proposed for score normalization. Experimental studies on four popular gene expression data sets show that the multi-criterion combination based on the proposed score normalization produces gene rankings with improved robustness. Secondly, a multi-criterion fusion-based recursive feature elimination (MCF-RFE) algorithm has been developed with the goal of improving both classification performance and robustness of feature subset selection. Feature subset selection often aims to select a compact subset of features to build a pattern classifier with reduced complexity. From the perspective of pattern analysis, producing stable or robust solution is also a desired property of a feature subset selection algorithm. In the thesis, we analyze the robustness issue existing in feature subset selection for HDSS gene expression data, and propose the MCF-RFE algorithm. Experimental studies on five gene expression data sets show that the MCF-RFE algorithm outperforms the commonly used benchmark feature selection algorithm SVM-RFE. Thirdly, a new regularized linear discriminant analysis (LDA) based algorithm has been proposed for robust feature selection of HDSS data. When applied to gene expression data which usually have high dimensionality, small sample size and class imbalance (i.e. great discrepancy in the number of samples between classes), LDA-based feature selection encounters problems such as singularity of scatter matrix, overfitting, overwhelming and prohibitive computational complexity. In the thesis, we propose a new regularization technique giving more emphasis to minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In addition, an incremental implementation of LDA-based feature selection has been developed to reduce computational overhead. Comparative studies on five gene microarray problems show that LDA with the new regularization can produce gene subsets with excellent performance in both classification and robustness. Fourthly, a framework of reducing feature redundancy has been proposed. In order to improve compactness of feature subsets, the usual practice is to utmostly remove redundancy. But this strategy does not necessarily help in some problems such as HDSS gene expression data analysis. We argue that a moderate degree of feature redundancy should be retained in feature subset for improving both classification and robustness. The proposed framework based on this argument is flexible and can be applied to any feature ranking algorithms. We have implemented the framework with the popular feature ranking algorithm{Fisher's ratio, and conducted experiments on five gene expression data sets. The experimental studies show that the proposed framework produces feature subsets with improved classification performance and good performance in robustness.
author2	Mao Kezhi
author_facet	Mao Kezhi Yang, Feng
format	Theses and Dissertations
author	Yang, Feng
author_sort	Yang, Feng
title	Robust feature selection for high-dimensional and small-sized gene expression data
title_short	Robust feature selection for high-dimensional and small-sized gene expression data
title_full	Robust feature selection for high-dimensional and small-sized gene expression data
title_fullStr	Robust feature selection for high-dimensional and small-sized gene expression data
title_full_unstemmed	Robust feature selection for high-dimensional and small-sized gene expression data
title_sort	robust feature selection for high-dimensional and small-sized gene expression data
publishDate	2012
url	https://hdl.handle.net/10356/48641
_version_	1772825500593422336
spelling	sg-ntu-dr.10356-486412023-07-04T16:46:18Z Robust feature selection for high-dimensional and small-sized gene expression data Yang, Feng Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems One important issue in constructing a pattern recognition system is feature selection. The goal of feature selection, including feature ranking and feature subset selection, is to identify target-relevant features. When applied to high-dimensional and small-sized (HDSS) data, e.g. microarray gene expression data, the commonly used feature selection algorithms encounter problems such as over-sensitivity to variations in training data, i.e. robustness issue. The aim of this thesis is to address the robustness issue of feature selection for the HDSS data. Firstly, a novel criteria normalization algorithm has been proposed for multi-criterion combination to improve robustness of feature ranking. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with slight perturbations in training samples when applied to HDSS data. A widely used strategy for solving the inconsistency problems is multi-criterion combination. But one crucial problem in multi-criterion combination is how to normalize feature scores from different criteria. In the thesis, a new feature importance transformation algorithm based on resampling and permutation is proposed for score normalization. Experimental studies on four popular gene expression data sets show that the multi-criterion combination based on the proposed score normalization produces gene rankings with improved robustness. Secondly, a multi-criterion fusion-based recursive feature elimination (MCF-RFE) algorithm has been developed with the goal of improving both classification performance and robustness of feature subset selection. Feature subset selection often aims to select a compact subset of features to build a pattern classifier with reduced complexity. From the perspective of pattern analysis, producing stable or robust solution is also a desired property of a feature subset selection algorithm. In the thesis, we analyze the robustness issue existing in feature subset selection for HDSS gene expression data, and propose the MCF-RFE algorithm. Experimental studies on five gene expression data sets show that the MCF-RFE algorithm outperforms the commonly used benchmark feature selection algorithm SVM-RFE. Thirdly, a new regularized linear discriminant analysis (LDA) based algorithm has been proposed for robust feature selection of HDSS data. When applied to gene expression data which usually have high dimensionality, small sample size and class imbalance (i.e. great discrepancy in the number of samples between classes), LDA-based feature selection encounters problems such as singularity of scatter matrix, overfitting, overwhelming and prohibitive computational complexity. In the thesis, we propose a new regularization technique giving more emphasis to minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In addition, an incremental implementation of LDA-based feature selection has been developed to reduce computational overhead. Comparative studies on five gene microarray problems show that LDA with the new regularization can produce gene subsets with excellent performance in both classification and robustness. Fourthly, a framework of reducing feature redundancy has been proposed. In order to improve compactness of feature subsets, the usual practice is to utmostly remove redundancy. But this strategy does not necessarily help in some problems such as HDSS gene expression data analysis. We argue that a moderate degree of feature redundancy should be retained in feature subset for improving both classification and robustness. The proposed framework based on this argument is flexible and can be applied to any feature ranking algorithms. We have implemented the framework with the popular feature ranking algorithm{Fisher's ratio, and conducted experiments on five gene expression data sets. The experimental studies show that the proposed framework produces feature subsets with improved classification performance and good performance in robustness. DOCTOR OF PHILOSOPHY (EEE) 2012-05-04T07:59:46Z 2012-05-04T07:59:46Z 2011 2011 Thesis Yang, F. (2011). Robust feature selection for high-dimensional and small-sized gene expression data. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/48641 10.32657/10356/48641 en 238 p. application/pdf

Robust feature selection for high-dimensional and small-sized gene expression data

Similar Items