Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data

The large proportion of irrelevant or noisy features in reallife high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subse...

Full description

Saved in:
Bibliographic Details
Main Authors: PANG, Guansong, CAO, Longbing, CHEN, Ling, LIAN, Defu, LIU, Huan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2018
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7140
https://ink.library.smu.edu.sg/context/sis_research/article/8143/viewcontent/11692_Article_Text_15220_1_2_20201228.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8143
record_format dspace
spelling sg-smu-ink.sis_research-81432022-04-22T04:22:09Z Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data PANG, Guansong CAO, Longbing CHEN, Ling LIAN, Defu LIU, Huan The large proportion of irrelevant or noisy features in reallife high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli’s inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-theart competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode. 2018-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7140 https://ink.library.smu.edu.sg/context/sis_research/article/8143/viewcontent/11692_Article_Text_15220_1_2_20201228.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Outlier Detection Outlier Ensemble Feature Selection Sparse Modeling Sequential Ensemble Databases and Information Systems Data Storage Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Outlier Detection
Outlier Ensemble
Feature Selection
Sparse Modeling
Sequential Ensemble
Databases and Information Systems
Data Storage Systems
spellingShingle Outlier Detection
Outlier Ensemble
Feature Selection
Sparse Modeling
Sequential Ensemble
Databases and Information Systems
Data Storage Systems
PANG, Guansong
CAO, Longbing
CHEN, Ling
LIAN, Defu
LIU, Huan
Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
description The large proportion of irrelevant or noisy features in reallife high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli’s inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-theart competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.
format text
author PANG, Guansong
CAO, Longbing
CHEN, Ling
LIAN, Defu
LIU, Huan
author_facet PANG, Guansong
CAO, Longbing
CHEN, Ling
LIAN, Defu
LIU, Huan
author_sort PANG, Guansong
title Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
title_short Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
title_full Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
title_fullStr Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
title_full_unstemmed Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
title_sort sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
publisher Institutional Knowledge at Singapore Management University
publishDate 2018
url https://ink.library.smu.edu.sg/sis_research/7140
https://ink.library.smu.edu.sg/context/sis_research/article/8143/viewcontent/11692_Article_Text_15220_1_2_20201228.pdf
_version_ 1770576230596214784