IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes ar...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/73377 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Imbalanced data usually refers to a condition in which some of the data samples in
a particular problem are not evenly distributed, causing under-representation of
one or more classes in the data set. These underrepresented classes are referred to
as the minority class, while the other classes are called the majority class. The
uneven distribution of data causes the predictive accuracy results from machine
learning to be inaccurate in predicting the minority class, resulting in varying error
costs. While many practical applications use highly imbalanced data where the
target variable belongs to the minority class, properly classifying the minority class
example is often more important than correctly classifying the majority class
example.
Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become
the standard in the framework of learning from imbalanced data. SMOTE is an
oversampling technique that generates synthetic data based on k Nearest Neighbors
(kNN) from each minor data. SMOTE synthesizes new minority data that is not the
same as the original data, thereby reducing the impact of overfitting on minority
classes. However, SMOTE also has some limitations. SMOTE can generate noise,
thus allowing examples of synthetic minority class data generated to be included in
the majority class. On the other hand, the determination of kNN on SMOTE still
uses the euclidean distance which becomes less effective with increasing data
dimensions. These limitations of SMOTE certainly complicate the learning task and
affect the predictive accuracy performance of the learning algorithm.
Therefore, this study aims to improve SMOTE in order to improve machine learning
predictive accuracy performance in handling imbalanced data. The proposed
approach is to deal with the problem of binary imbalanced classification.
Improvements were made by identifying noise from synthetic minority data
generated by SMOTE using Local Outlier Factor (LOF). The proposed method is
SMOTE-LOF, where the experiment is carried out using imbalanced datasets with
predictive accuracy performance results compared to SMOTE performance. The
results showed that SMOTE-LOF produced better accuracy and f-measure than
SMOTE.
iv
In addition, this study also analyzes changes in distance metrics in determining
kNN on SMOTE from euclidean distance to manhattan distance and cosine
distance, then analyzes the interaction effect of distance metric with imbalance
ratio (IR) and the number of attributes on the performance of SMOTE prediction
accuracy in handling imbalanced data. Experiments were carried out using an
imbalanced datasets accompanied by a comparison of the prediction accuracy
performance results obtained from each dataset for each distance metric. The
results showed that the interaction of the three distance metrics with the imbalance
ratio and the number of attributes had no significant effect on increasing prediction
accuracy performance. However, better performance was found for SMOTE when
using manhattan distance, compared to euclidean distance and cosine distance.
The proposed approach shows that it has achieved the research objective, which is
to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE-
LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By
changing the distance metric in SMOTE from euclidean distance to manhattan
distance had better average performance on F1 Score and AUC of 6.93% and 3%,
respectively. |
---|