Homophily outlier detection in non-IID categorical data

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness o...

Full description

Saved in:

Bibliographic Details
Main Authors:	PANG, Guansong, CAO, Longbing, CHEN, Ling
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	Outlier Detection Feature Selection Non-IID Learning Categorical Data Homophily Relation Random Walk Coupling Learning Artificial Intelligence and Robotics Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/7017 https://ink.library.smu.edu.sg/context/sis_research/article/8020/viewcontent/2103.11516.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8020
record_format	dspace
spelling	sg-smu-ink.sis_research-80202022-03-17T15:06:45Z Homophily outlier detection in non-IID categorical data PANG, Guansong CAO, Longbing CHEN, Ling Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors. 2021-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7017 https://ink.library.smu.edu.sg/context/sis_research/article/8020/viewcontent/2103.11516.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Outlier Detection Feature Selection Non-IID Learning Categorical Data Homophily Relation Random Walk Coupling Learning Artificial Intelligence and Robotics Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Outlier Detection Feature Selection Non-IID Learning Categorical Data Homophily Relation Random Walk Coupling Learning Artificial Intelligence and Robotics Databases and Information Systems
spellingShingle	Outlier Detection Feature Selection Non-IID Learning Categorical Data Homophily Relation Random Walk Coupling Learning Artificial Intelligence and Robotics Databases and Information Systems PANG, Guansong CAO, Longbing CHEN, Ling Homophily outlier detection in non-IID categorical data
description	Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.
format	text
author	PANG, Guansong CAO, Longbing CHEN, Ling
author_facet	PANG, Guansong CAO, Longbing CHEN, Ling
author_sort	PANG, Guansong
title	Homophily outlier detection in non-IID categorical data
title_short	Homophily outlier detection in non-IID categorical data
title_full	Homophily outlier detection in non-IID categorical data
title_fullStr	Homophily outlier detection in non-IID categorical data
title_full_unstemmed	Homophily outlier detection in non-IID categorical data
title_sort	homophily outlier detection in non-iid categorical data
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/7017 https://ink.library.smu.edu.sg/context/sis_research/article/8020/viewcontent/2103.11516.pdf
_version_	1770576188587114496

Homophily outlier detection in non-IID categorical data

Similar Items