IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes ar...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/73377 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:73377 |
---|---|
spelling |
id-itb.:733772023-06-20T08:43:45ZIMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE Asniar Indonesia Dissertations imbalanced data, SMOTE, noise, outliers, distance metric, predictive accuracy, machine learning INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/73377 Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes are called the majority class. The uneven distribution of data causes the predictive accuracy results from machine learning to be inaccurate in predicting the minority class, resulting in varying error costs. While many practical applications use highly imbalanced data where the target variable belongs to the minority class, properly classifying the minority class example is often more important than correctly classifying the majority class example. Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become the standard in the framework of learning from imbalanced data. SMOTE is an oversampling technique that generates synthetic data based on k Nearest Neighbors (kNN) from each minor data. SMOTE synthesizes new minority data that is not the same as the original data, thereby reducing the impact of overfitting on minority classes. However, SMOTE also has some limitations. SMOTE can generate noise, thus allowing examples of synthetic minority class data generated to be included in the majority class. On the other hand, the determination of kNN on SMOTE still uses the euclidean distance which becomes less effective with increasing data dimensions. These limitations of SMOTE certainly complicate the learning task and affect the predictive accuracy performance of the learning algorithm. Therefore, this study aims to improve SMOTE in order to improve machine learning predictive accuracy performance in handling imbalanced data. The proposed approach is to deal with the problem of binary imbalanced classification. Improvements were made by identifying noise from synthetic minority data generated by SMOTE using Local Outlier Factor (LOF). The proposed method is SMOTE-LOF, where the experiment is carried out using imbalanced datasets with predictive accuracy performance results compared to SMOTE performance. The results showed that SMOTE-LOF produced better accuracy and f-measure than SMOTE. iv In addition, this study also analyzes changes in distance metrics in determining kNN on SMOTE from euclidean distance to manhattan distance and cosine distance, then analyzes the interaction effect of distance metric with imbalance ratio (IR) and the number of attributes on the performance of SMOTE prediction accuracy in handling imbalanced data. Experiments were carried out using an imbalanced datasets accompanied by a comparison of the prediction accuracy performance results obtained from each dataset for each distance metric. The results showed that the interaction of the three distance metrics with the imbalance ratio and the number of attributes had no significant effect on increasing prediction accuracy performance. However, better performance was found for SMOTE when using manhattan distance, compared to euclidean distance and cosine distance. The proposed approach shows that it has achieved the research objective, which is to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE- LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By changing the distance metric in SMOTE from euclidean distance to manhattan distance had better average performance on F1 Score and AUC of 6.93% and 3%, respectively. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Imbalanced data usually refers to a condition in which some of the data samples in
a particular problem are not evenly distributed, causing under-representation of
one or more classes in the data set. These underrepresented classes are referred to
as the minority class, while the other classes are called the majority class. The
uneven distribution of data causes the predictive accuracy results from machine
learning to be inaccurate in predicting the minority class, resulting in varying error
costs. While many practical applications use highly imbalanced data where the
target variable belongs to the minority class, properly classifying the minority class
example is often more important than correctly classifying the majority class
example.
Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become
the standard in the framework of learning from imbalanced data. SMOTE is an
oversampling technique that generates synthetic data based on k Nearest Neighbors
(kNN) from each minor data. SMOTE synthesizes new minority data that is not the
same as the original data, thereby reducing the impact of overfitting on minority
classes. However, SMOTE also has some limitations. SMOTE can generate noise,
thus allowing examples of synthetic minority class data generated to be included in
the majority class. On the other hand, the determination of kNN on SMOTE still
uses the euclidean distance which becomes less effective with increasing data
dimensions. These limitations of SMOTE certainly complicate the learning task and
affect the predictive accuracy performance of the learning algorithm.
Therefore, this study aims to improve SMOTE in order to improve machine learning
predictive accuracy performance in handling imbalanced data. The proposed
approach is to deal with the problem of binary imbalanced classification.
Improvements were made by identifying noise from synthetic minority data
generated by SMOTE using Local Outlier Factor (LOF). The proposed method is
SMOTE-LOF, where the experiment is carried out using imbalanced datasets with
predictive accuracy performance results compared to SMOTE performance. The
results showed that SMOTE-LOF produced better accuracy and f-measure than
SMOTE.
iv
In addition, this study also analyzes changes in distance metrics in determining
kNN on SMOTE from euclidean distance to manhattan distance and cosine
distance, then analyzes the interaction effect of distance metric with imbalance
ratio (IR) and the number of attributes on the performance of SMOTE prediction
accuracy in handling imbalanced data. Experiments were carried out using an
imbalanced datasets accompanied by a comparison of the prediction accuracy
performance results obtained from each dataset for each distance metric. The
results showed that the interaction of the three distance metrics with the imbalance
ratio and the number of attributes had no significant effect on increasing prediction
accuracy performance. However, better performance was found for SMOTE when
using manhattan distance, compared to euclidean distance and cosine distance.
The proposed approach shows that it has achieved the research objective, which is
to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE-
LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By
changing the distance metric in SMOTE from euclidean distance to manhattan
distance had better average performance on F1 Score and AUC of 6.93% and 3%,
respectively. |
format |
Dissertations |
author |
Asniar |
spellingShingle |
Asniar IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
author_facet |
Asniar |
author_sort |
Asniar |
title |
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
title_short |
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
title_full |
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
title_fullStr |
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
title_full_unstemmed |
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE |
title_sort |
imbalanced data handling to improve predictive accuracy performance |
url |
https://digilib.itb.ac.id/gdl/view/73377 |
_version_ |
1822992983820075008 |