FISA: Feature-based instance selection for imbalanced text classification

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (F...

Full description

Saved in:
Bibliographic Details
Main Authors: SUN, Aixin, LIM, Ee Peng, Benatallah, Boualem, Hassan, Mahbub
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2006
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/894
https://ink.library.smu.edu.sg/context/sis_research/article/1893/viewcontent/Sun2006_Chapter_FISAFeature_BasedInstanceSelec.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.