IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE

Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes ar...

Full description

Saved in:

Bibliographic Details
Main Author:	Asniar
Format:	Dissertations
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/73377
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

Description
Summary:	Imbalanced data usually refers to a condition in which some of the data samples in a particular problem are not evenly distributed, causing under-representation of one or more classes in the data set. These underrepresented classes are referred to as the minority class, while the other classes are called the majority class. The uneven distribution of data causes the predictive accuracy results from machine learning to be inaccurate in predicting the minority class, resulting in varying error costs. While many practical applications use highly imbalanced data where the target variable belongs to the minority class, properly classifying the minority class example is often more important than correctly classifying the majority class example. Currently, the Synthetic Minority Oversampling Technique (SMOTE) has become the standard in the framework of learning from imbalanced data. SMOTE is an oversampling technique that generates synthetic data based on k Nearest Neighbors (kNN) from each minor data. SMOTE synthesizes new minority data that is not the same as the original data, thereby reducing the impact of overfitting on minority classes. However, SMOTE also has some limitations. SMOTE can generate noise, thus allowing examples of synthetic minority class data generated to be included in the majority class. On the other hand, the determination of kNN on SMOTE still uses the euclidean distance which becomes less effective with increasing data dimensions. These limitations of SMOTE certainly complicate the learning task and affect the predictive accuracy performance of the learning algorithm. Therefore, this study aims to improve SMOTE in order to improve machine learning predictive accuracy performance in handling imbalanced data. The proposed approach is to deal with the problem of binary imbalanced classification. Improvements were made by identifying noise from synthetic minority data generated by SMOTE using Local Outlier Factor (LOF). The proposed method is SMOTE-LOF, where the experiment is carried out using imbalanced datasets with predictive accuracy performance results compared to SMOTE performance. The results showed that SMOTE-LOF produced better accuracy and f-measure than SMOTE. iv In addition, this study also analyzes changes in distance metrics in determining kNN on SMOTE from euclidean distance to manhattan distance and cosine distance, then analyzes the interaction effect of distance metric with imbalance ratio (IR) and the number of attributes on the performance of SMOTE prediction accuracy in handling imbalanced data. Experiments were carried out using an imbalanced datasets accompanied by a comparison of the prediction accuracy performance results obtained from each dataset for each distance metric. The results showed that the interaction of the three distance metrics with the imbalance ratio and the number of attributes had no significant effect on increasing prediction accuracy performance. However, better performance was found for SMOTE when using manhattan distance, compared to euclidean distance and cosine distance. The proposed approach shows that it has achieved the research objective, which is to improve SMOTE. By identifying SMOTE noise and then removing it, SMOTE- LOF has 2 4% better accuracy and 1 6% better f-measure than SMOTE. By changing the distance metric in SMOTE from euclidean distance to manhattan distance had better average performance on F1 Score and AUC of 6.93% and 3%, respectively.

IMBALANCED DATA HANDLING TO IMPROVE PREDICTIVE ACCURACY PERFORMANCE

Similar Items