Towards ultrahigh dimensional feature selection for big data

In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an eficient feature generating paradigm. Different from traditional gra...

Full description

Saved in:
Bibliographic Details
Main Authors: Tan, Mingkui, Tsang, Ivor W., Wang, Li
Other Authors: School of Computer Engineering
Format: Article
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/105805
http://hdl.handle.net/10220/20902
http://www.jmlr.org/papers/v15/tan14a.html
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-105805
record_format dspace
spelling sg-ntu-dr.10356-1058052020-05-28T07:41:33Z Towards ultrahigh dimensional feature selection for big data Tan, Mingkui Tsang, Ivor W. Wang, Li School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Data In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an eficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some eficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(1014) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training eficiency. Published version 2014-09-17T09:14:54Z 2019-12-06T21:58:16Z 2014-09-17T09:14:54Z 2019-12-06T21:58:16Z 2014 2014 Journal Article Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of machine learning research, 15, 1371-1429. https://hdl.handle.net/10356/105805 http://hdl.handle.net/10220/20902 http://www.jmlr.org/papers/v15/tan14a.html en Journal of machine learning research © 2014 The Authors(Journal of Machine Learning Research). This paper was published in Journal of Machine Learning Research and is made available as an electronic reprint (preprint) with permission of The Authors(Journal of Machine Learning Research). The paper can be found at the following official URL: [http://jmlr.org/papers/volume15/tan14a/tan14a.pdf]. One print or electronic copy may be made for personal use only. Systematic or multiple reproduction, distribution to multiple locations via electronic or other means, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper is prohibited and is subject to penalties under law. 60 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Data
spellingShingle DRNTU::Engineering::Computer science and engineering::Data
Tan, Mingkui
Tsang, Ivor W.
Wang, Li
Towards ultrahigh dimensional feature selection for big data
description In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an eficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some eficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(1014) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training eficiency.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Tan, Mingkui
Tsang, Ivor W.
Wang, Li
format Article
author Tan, Mingkui
Tsang, Ivor W.
Wang, Li
author_sort Tan, Mingkui
title Towards ultrahigh dimensional feature selection for big data
title_short Towards ultrahigh dimensional feature selection for big data
title_full Towards ultrahigh dimensional feature selection for big data
title_fullStr Towards ultrahigh dimensional feature selection for big data
title_full_unstemmed Towards ultrahigh dimensional feature selection for big data
title_sort towards ultrahigh dimensional feature selection for big data
publishDate 2014
url https://hdl.handle.net/10356/105805
http://hdl.handle.net/10220/20902
http://www.jmlr.org/papers/v15/tan14a.html
_version_ 1681056538323582976