Large-scale online feature selection for ultra-high dimensional sparse data

Feature selection (FS) is an important technique in machine learning and data mining, especially for large scale high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world. As real data may...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	WU, Yue, HOI, Steven C. H., MEI, Tao, YU, Nenghai
التنسيق:	text
اللغة:	English
منشور في:	Institutional Knowledge at Singapore Management University 2017
الموضوعات:	Feature selection Second-order online learning Sparsity Ultra-high dimensionality Databases and Information Systems Numerical Analysis and Computation Theory and Algorithms
الوصول للمادة أونلاين:	https://ink.library.smu.edu.sg/sis_research/3781 https://ink.library.smu.edu.sg/context/sis_research/article/4783/viewcontent/Large_Scale_Online_Feature_Selection_Ultra_high_2017_afv.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Singapore Management University
اللغة:	English

id	sg-smu-ink.sis_research-4783
record_format	dspace
spelling	sg-smu-ink.sis_research-47832021-03-26T07:28:13Z Large-scale online feature selection for ultra-high dimensional sparse data WU, Yue HOI, Steven C. H. MEI, Tao YU, Nenghai Feature selection (FS) is an important technique in machine learning and data mining, especially for large scale high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world. As real data may arrive sequentially and continuously, batch learning has to retrain the model for the new coming data, which is very computationally intensive. Online feature selection (OFS) is a promising new paradigm that is more efficient and scalable than batch learning algorithms. However, existing online algorithms usually fall short in their inferior efficacy. In this article, we present a novel second-order OFS algorithm that is simple yet effective, very fast and extremely scalable to deal with large-scale ultra-high dimensional sparse data streams. The basic idea is to exploit the second-order information to choose the subset of important features with high confidence weights. Unlike existing OFS methods that often suffer from extra high computational cost, we devise a novel algorithm with a MaxHeap-based approach, which is not only more effective than the existing first order algorithms, but also significantly more efficient and scalable. Our extensive experiments validated that the proposed technique achieves highly competitive accuracy as compared with state-of-The-Art batch FS methods, meanwhile it consumes significantly less computational cost that is orders of magnitude lower. Impressively, on a billion-scale synthetic dataset (1-billion dimensions, 1-billion non-zero features, and 1- million samples), the proposed algorithm takes less than 3 minutes to run on a single PC. 2017-08-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3781 info:doi/10.1145/3070646 https://ink.library.smu.edu.sg/context/sis_research/article/4783/viewcontent/Large_Scale_Online_Feature_Selection_Ultra_high_2017_afv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Feature selection Second-order online learning Sparsity Ultra-high dimensionality Databases and Information Systems Numerical Analysis and Computation Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Feature selection Second-order online learning Sparsity Ultra-high dimensionality Databases and Information Systems Numerical Analysis and Computation Theory and Algorithms
spellingShingle	Feature selection Second-order online learning Sparsity Ultra-high dimensionality Databases and Information Systems Numerical Analysis and Computation Theory and Algorithms WU, Yue HOI, Steven C. H. MEI, Tao YU, Nenghai Large-scale online feature selection for ultra-high dimensional sparse data
description	Feature selection (FS) is an important technique in machine learning and data mining, especially for large scale high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world. As real data may arrive sequentially and continuously, batch learning has to retrain the model for the new coming data, which is very computationally intensive. Online feature selection (OFS) is a promising new paradigm that is more efficient and scalable than batch learning algorithms. However, existing online algorithms usually fall short in their inferior efficacy. In this article, we present a novel second-order OFS algorithm that is simple yet effective, very fast and extremely scalable to deal with large-scale ultra-high dimensional sparse data streams. The basic idea is to exploit the second-order information to choose the subset of important features with high confidence weights. Unlike existing OFS methods that often suffer from extra high computational cost, we devise a novel algorithm with a MaxHeap-based approach, which is not only more effective than the existing first order algorithms, but also significantly more efficient and scalable. Our extensive experiments validated that the proposed technique achieves highly competitive accuracy as compared with state-of-The-Art batch FS methods, meanwhile it consumes significantly less computational cost that is orders of magnitude lower. Impressively, on a billion-scale synthetic dataset (1-billion dimensions, 1-billion non-zero features, and 1- million samples), the proposed algorithm takes less than 3 minutes to run on a single PC.
format	text
author	WU, Yue HOI, Steven C. H. MEI, Tao YU, Nenghai
author_facet	WU, Yue HOI, Steven C. H. MEI, Tao YU, Nenghai
author_sort	WU, Yue
title	Large-scale online feature selection for ultra-high dimensional sparse data
title_short	Large-scale online feature selection for ultra-high dimensional sparse data
title_full	Large-scale online feature selection for ultra-high dimensional sparse data
title_fullStr	Large-scale online feature selection for ultra-high dimensional sparse data
title_full_unstemmed	Large-scale online feature selection for ultra-high dimensional sparse data
title_sort	large-scale online feature selection for ultra-high dimensional sparse data
publisher	Institutional Knowledge at Singapore Management University
publishDate	2017
url	https://ink.library.smu.edu.sg/sis_research/3781 https://ink.library.smu.edu.sg/context/sis_research/article/4783/viewcontent/Large_Scale_Online_Feature_Selection_Ultra_high_2017_afv.pdf
_version_	1770573731548102656

Large-scale online feature selection for ultra-high dimensional sparse data

مواد مشابهة