CenKNN: A scalable and effective text classifier

A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with var...

Full description

Saved in:
Bibliographic Details
Main Authors: PANG, Guansong, JIN, Huidong, JIANG, Shengyi
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2014
Subjects:
KNN
Online Access:https://ink.library.smu.edu.sg/sis_research/7027
https://ink.library.smu.edu.sg/context/sis_research/article/8030/viewcontent/Pang2015_Article_CenKNNAScalableAndEffectiveTex.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8030
record_format dspace
spelling sg-smu-ink.sis_research-80302022-03-17T14:59:20Z CenKNN: A scalable and effective text classifier PANG, Guansong JIN, Huidong JIANG, Shengyi A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNN projects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the \(k\)-d tree structure to find \(K\) nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNN overcomes two issues related to existing KNN text classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNN substantially reduces the expensive computation time in KNN. CenKNN also works better than Centroid since it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNN works on a significantly lower-dimensional space, it performs substantially better than KNN and its five variants, and existing scalable classifiers, including Centroid and Rocchio. CenKNN is also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes. 2014-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7027 info:doi/10.1007/s10618-014-0358-x https://ink.library.smu.edu.sg/context/sis_research/article/8030/viewcontent/Pang2015_Article_CenKNNAScalableAndEffectiveTex.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Text classification KNN Centroid Dimension reduction Imbalanced classification Artificial Intelligence and Robotics Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Text classification
KNN
Centroid
Dimension reduction
Imbalanced classification
Artificial Intelligence and Robotics
Databases and Information Systems
spellingShingle Text classification
KNN
Centroid
Dimension reduction
Imbalanced classification
Artificial Intelligence and Robotics
Databases and Information Systems
PANG, Guansong
JIN, Huidong
JIANG, Shengyi
CenKNN: A scalable and effective text classifier
description A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNN projects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the \(k\)-d tree structure to find \(K\) nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNN overcomes two issues related to existing KNN text classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNN substantially reduces the expensive computation time in KNN. CenKNN also works better than Centroid since it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNN works on a significantly lower-dimensional space, it performs substantially better than KNN and its five variants, and existing scalable classifiers, including Centroid and Rocchio. CenKNN is also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes.
format text
author PANG, Guansong
JIN, Huidong
JIANG, Shengyi
author_facet PANG, Guansong
JIN, Huidong
JIANG, Shengyi
author_sort PANG, Guansong
title CenKNN: A scalable and effective text classifier
title_short CenKNN: A scalable and effective text classifier
title_full CenKNN: A scalable and effective text classifier
title_fullStr CenKNN: A scalable and effective text classifier
title_full_unstemmed CenKNN: A scalable and effective text classifier
title_sort cenknn: a scalable and effective text classifier
publisher Institutional Knowledge at Singapore Management University
publishDate 2014
url https://ink.library.smu.edu.sg/sis_research/7027
https://ink.library.smu.edu.sg/context/sis_research/article/8030/viewcontent/Pang2015_Article_CenKNNAScalableAndEffectiveTex.pdf
_version_ 1770576190520688640