FISA: Feature-based instance selection for imbalanced text classification

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (F...

Full description

Saved in:
Bibliographic Details
Main Authors: SUN, Aixin, LIM, Ee Peng, Benatallah, Boualem, Hassan, Mahbub
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2006
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/894
https://ink.library.smu.edu.sg/context/sis_research/article/1893/viewcontent/Sun2006_Chapter_FISAFeature_BasedInstanceSelec.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-1893
record_format dspace
spelling sg-smu-ink.sis_research-18932018-06-25T08:54:19Z FISA: Feature-based instance selection for imbalanced text classification SUN, Aixin LIM, Ee Peng Benatallah, Boualem Hassan, Mahbub Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents. 2006-04-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/894 info:doi/10.1007/11731139_30 https://ink.library.smu.edu.sg/context/sis_research/article/1893/viewcontent/Sun2006_Chapter_FISAFeature_BasedInstanceSelec.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Vector support machine Statistical analysis Electronic discussion group Classification Natural language Text Information retrieval Content analysis Data analysis Knowledge discovery Data mining Databases and Information Systems Numerical Analysis and Scientific Computing
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Vector support machine
Statistical analysis
Electronic discussion group
Classification
Natural language
Text
Information retrieval
Content analysis
Data analysis
Knowledge discovery
Data mining
Databases and Information Systems
Numerical Analysis and Scientific Computing
spellingShingle Vector support machine
Statistical analysis
Electronic discussion group
Classification
Natural language
Text
Information retrieval
Content analysis
Data analysis
Knowledge discovery
Data mining
Databases and Information Systems
Numerical Analysis and Scientific Computing
SUN, Aixin
LIM, Ee Peng
Benatallah, Boualem
Hassan, Mahbub
FISA: Feature-based instance selection for imbalanced text classification
description Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.
format text
author SUN, Aixin
LIM, Ee Peng
Benatallah, Boualem
Hassan, Mahbub
author_facet SUN, Aixin
LIM, Ee Peng
Benatallah, Boualem
Hassan, Mahbub
author_sort SUN, Aixin
title FISA: Feature-based instance selection for imbalanced text classification
title_short FISA: Feature-based instance selection for imbalanced text classification
title_full FISA: Feature-based instance selection for imbalanced text classification
title_fullStr FISA: Feature-based instance selection for imbalanced text classification
title_full_unstemmed FISA: Feature-based instance selection for imbalanced text classification
title_sort fisa: feature-based instance selection for imbalanced text classification
publisher Institutional Knowledge at Singapore Management University
publishDate 2006
url https://ink.library.smu.edu.sg/sis_research/894
https://ink.library.smu.edu.sg/context/sis_research/article/1893/viewcontent/Sun2006_Chapter_FISAFeature_BasedInstanceSelec.pdf
_version_ 1770570761003597824