A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using le...

Full description

Saved in:
Bibliographic Details
Main Authors: Algamal, Zakariya Yahya, Lee, Muhammad Hisyam
Format: Article
Published: Springer Verlag 2019
Subjects:
Online Access:http://eprints.utm.my/id/eprint/96970/
http://dx.doi.org/10.1007/s11634-018-0334-1
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
id my.utm.96970
record_format eprints
spelling my.utm.969702022-09-06T07:24:05Z http://eprints.utm.my/id/eprint/96970/ A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification Algamal, Zakariya Yahya Lee, Muhammad Hisyam QA Mathematics The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice. Springer Verlag 2019 Article PeerReviewed Algamal, Zakariya Yahya and Lee, Muhammad Hisyam (2019) A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Advances in Data Analysis and Classification, 13 (3). pp. 753-771. ISSN 1862-5347 http://dx.doi.org/10.1007/s11634-018-0334-1 DOI : 10.1007/s11634-018-0334-1
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic QA Mathematics
spellingShingle QA Mathematics
Algamal, Zakariya Yahya
Lee, Muhammad Hisyam
A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
description The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.
format Article
author Algamal, Zakariya Yahya
Lee, Muhammad Hisyam
author_facet Algamal, Zakariya Yahya
Lee, Muhammad Hisyam
author_sort Algamal, Zakariya Yahya
title A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
title_short A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
title_full A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
title_fullStr A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
title_full_unstemmed A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
title_sort two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
publisher Springer Verlag
publishDate 2019
url http://eprints.utm.my/id/eprint/96970/
http://dx.doi.org/10.1007/s11634-018-0334-1
_version_ 1744353696221757440