Under-sampling by algorithm with performance guaranteed for class-imbalance problem

© 2014 IEEE. Class-imbalance problem is the problem that the number, or data, in the majority class is much more than in the minority class. Traditional classifiers cannot sort out this problem because they focus on the data in the majority class than on the data in the minority class, and then they...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wattana Jindaluang, Varin Chouvatut, Sanpawat Kantabutra
Format:	Conference Proceeding
Published:	2018
Online Access:	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84942909601&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/45379
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Chiang Mai University

id	th-cmuir.6653943832-45379
record_format	dspace
spelling	th-cmuir.6653943832-453792018-01-24T06:09:25Z Under-sampling by algorithm with performance guaranteed for class-imbalance problem Wattana Jindaluang Varin Chouvatut Sanpawat Kantabutra © 2014 IEEE. Class-imbalance problem is the problem that the number, or data, in the majority class is much more than in the minority class. Traditional classifiers cannot sort out this problem because they focus on the data in the majority class than on the data in the minority class, and then they predict some upcoming data as the data in the majority class. Under-sampling is an efficient way to handle this problem because this method selects the representatives of the data in the majority class. For this reason, under-sampling occupies shorter training period than over-sampling. The only problem with the under-sampling method is that a representative selection, in all probability, throws away important information in a majority class. To overcome this problem, we propose a cluster-based under-sampling method. We use a clustering algorithm that is performance guaranteed, named k-centers algorithm, which clusters the data in the majority class and selects a number of representative data in many proportions, and then combines them with all the data in the minority class as a training set. In this paper, we compare our approach with k-means on five datasets from UCI with two classifiers: 5-nearest neighbors and c4.5 decision tree. The performance is measured by Precision, Recall, F-measure, and Accuracy. The experimental results show that our approach has higher measurements than the k-means approach, except Precision where both the approaches have the same rate. 2018-01-24T06:09:25Z 2018-01-24T06:09:25Z 2014-01-01 Conference Proceeding 2-s2.0-84942909601 10.1109/ICSEC.2014.6978197 https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84942909601&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/45379
institution	Chiang Mai University
building	Chiang Mai University Library
country	Thailand
collection	CMU Intellectual Repository
description	© 2014 IEEE. Class-imbalance problem is the problem that the number, or data, in the majority class is much more than in the minority class. Traditional classifiers cannot sort out this problem because they focus on the data in the majority class than on the data in the minority class, and then they predict some upcoming data as the data in the majority class. Under-sampling is an efficient way to handle this problem because this method selects the representatives of the data in the majority class. For this reason, under-sampling occupies shorter training period than over-sampling. The only problem with the under-sampling method is that a representative selection, in all probability, throws away important information in a majority class. To overcome this problem, we propose a cluster-based under-sampling method. We use a clustering algorithm that is performance guaranteed, named k-centers algorithm, which clusters the data in the majority class and selects a number of representative data in many proportions, and then combines them with all the data in the minority class as a training set. In this paper, we compare our approach with k-means on five datasets from UCI with two classifiers: 5-nearest neighbors and c4.5 decision tree. The performance is measured by Precision, Recall, F-measure, and Accuracy. The experimental results show that our approach has higher measurements than the k-means approach, except Precision where both the approaches have the same rate.
format	Conference Proceeding
author	Wattana Jindaluang Varin Chouvatut Sanpawat Kantabutra
spellingShingle	Wattana Jindaluang Varin Chouvatut Sanpawat Kantabutra Under-sampling by algorithm with performance guaranteed for class-imbalance problem
author_facet	Wattana Jindaluang Varin Chouvatut Sanpawat Kantabutra
author_sort	Wattana Jindaluang
title	Under-sampling by algorithm with performance guaranteed for class-imbalance problem
title_short	Under-sampling by algorithm with performance guaranteed for class-imbalance problem
title_full	Under-sampling by algorithm with performance guaranteed for class-imbalance problem
title_fullStr	Under-sampling by algorithm with performance guaranteed for class-imbalance problem
title_full_unstemmed	Under-sampling by algorithm with performance guaranteed for class-imbalance problem
title_sort	under-sampling by algorithm with performance guaranteed for class-imbalance problem
publishDate	2018
url	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84942909601&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/45379
_version_	1681422734339342336

Under-sampling by algorithm with performance guaranteed for class-imbalance problem

Similar Items