IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE

Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes ar...

Full description

Saved in:
Bibliographic Details
Main Author: Asniar
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/73377
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:73377
spelling id-itb.:733772023-06-20T08:43:45ZIMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE Asniar Indonesia Dissertations imbalanced data, SMOTE, noise, outliers, distance metric, predictive accuracy, machine learning INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/73377 Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes are called the majority class. The uneven distribution of data causes the predictive accuracy results from machine learning to be inaccurate in predicting the minority class, resulting in varying error costs. While many practical applications use highly imbalanced data where the target variable belongs to the minority class, properly classifying the minority class example is often more important than correctly classifying the majority class example. Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become the standard in the framework of learning from imbalanced data. SMOTE is an oversampling technique that generates synthetic data based on k Nearest Neighbors (kNN) from each minor data. SMOTE synthesizes new minority data that is not the same as the original data, thereby reducing the impact of overfitting on minority classes. However, SMOTE also has some limitations. SMOTE can generate noise, thus allowing examples of synthetic minority class data generated to be included in the majority class. On the other hand, the determination of kNN on SMOTE still uses the euclidean distance which becomes less effective with increasing data dimensions. These limitations of SMOTE certainly complicate the learning task and affect the predictive accuracy performance of the learning algorithm. Therefore, this study aims to improve SMOTE in order to improve machine learning predictive accuracy performance in handling imbalanced data. The proposed approach is to deal with the problem of binary imbalanced classification. Improvements were made by identifying noise from synthetic minority data generated by SMOTE using Local Outlier Factor (LOF). The proposed method is SMOTE-LOF, where the experiment is carried out using imbalanced datasets with predictive accuracy performance results compared to SMOTE performance. The results showed that SMOTE-LOF produced better accuracy and f-measure than SMOTE. iv In addition, this study also analyzes changes in distance metrics in determining kNN on SMOTE from euclidean distance to manhattan distance and cosine distance, then analyzes the interaction effect of distance metric with imbalance ratio (IR) and the number of attributes on the performance of SMOTE prediction accuracy in handling imbalanced data. Experiments were carried out using an imbalanced datasets accompanied by a comparison of the prediction accuracy performance results obtained from each dataset for each distance metric. The results showed that the interaction of the three distance metrics with the imbalance ratio and the number of attributes had no significant effect on increasing prediction accuracy performance. However, better performance was found for SMOTE when using manhattan distance, compared to euclidean distance and cosine distance. The proposed approach shows that it has achieved the research objective, which is to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE- LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By changing the distance metric in SMOTE from euclidean distance to manhattan distance had better average performance on F1 Score and AUC of 6.93% and 3%, respectively. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes are called the majority class. The uneven distribution of data causes the predictive accuracy results from machine learning to be inaccurate in predicting the minority class, resulting in varying error costs. While many practical applications use highly imbalanced data where the target variable belongs to the minority class, properly classifying the minority class example is often more important than correctly classifying the majority class example. Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become the standard in the framework of learning from imbalanced data. SMOTE is an oversampling technique that generates synthetic data based on k Nearest Neighbors (kNN) from each minor data. SMOTE synthesizes new minority data that is not the same as the original data, thereby reducing the impact of overfitting on minority classes. However, SMOTE also has some limitations. SMOTE can generate noise, thus allowing examples of synthetic minority class data generated to be included in the majority class. On the other hand, the determination of kNN on SMOTE still uses the euclidean distance which becomes less effective with increasing data dimensions. These limitations of SMOTE certainly complicate the learning task and affect the predictive accuracy performance of the learning algorithm. Therefore, this study aims to improve SMOTE in order to improve machine learning predictive accuracy performance in handling imbalanced data. The proposed approach is to deal with the problem of binary imbalanced classification. Improvements were made by identifying noise from synthetic minority data generated by SMOTE using Local Outlier Factor (LOF). The proposed method is SMOTE-LOF, where the experiment is carried out using imbalanced datasets with predictive accuracy performance results compared to SMOTE performance. The results showed that SMOTE-LOF produced better accuracy and f-measure than SMOTE. iv In addition, this study also analyzes changes in distance metrics in determining kNN on SMOTE from euclidean distance to manhattan distance and cosine distance, then analyzes the interaction effect of distance metric with imbalance ratio (IR) and the number of attributes on the performance of SMOTE prediction accuracy in handling imbalanced data. Experiments were carried out using an imbalanced datasets accompanied by a comparison of the prediction accuracy performance results obtained from each dataset for each distance metric. The results showed that the interaction of the three distance metrics with the imbalance ratio and the number of attributes had no significant effect on increasing prediction accuracy performance. However, better performance was found for SMOTE when using manhattan distance, compared to euclidean distance and cosine distance. The proposed approach shows that it has achieved the research objective, which is to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE- LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By changing the distance metric in SMOTE from euclidean distance to manhattan distance had better average performance on F1 Score and AUC of 6.93% and 3%, respectively.
format Dissertations
author Asniar
spellingShingle Asniar
IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
author_facet Asniar
author_sort Asniar
title IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
title_short IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
title_full IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
title_fullStr IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
title_full_unstemmed IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE
title_sort imbalanced data handling to improve predictive accuracy performance
url https://digilib.itb.ac.id/gdl/view/73377
_version_ 1822992983820075008