THE FEATURE SELECTION ALGORITHM MODIFICATION IN THE RANDOM FOREST

Random Forest is a tree-based machine learning algorithm with an informative random feature selection process. One of the methods used to determine the level of importance in a dataset is Information Gain (IG). This process is used to calculate the amount of information contained in features with...

Full description

Saved in:
Bibliographic Details
Main Author: Irmina Prasetiyowati, Maria
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/70612
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Random Forest is a tree-based machine learning algorithm with an informative random feature selection process. One of the methods used to determine the level of importance in a dataset is Information Gain (IG). This process is used to calculate the amount of information contained in features with high information value selected to speed up the performance of an algorithm. In selecting informative features, IG uses a cut-off value. Mostly the threshold value being used is free, or 0,05. This study proposes a modification to the IG algorithm by determining the threshold value based on the standard deviation, median, and real median values for the transformed features. The main objective of this research is to find the execution speed of Random Forest while still paying attention to the accuracy value generated. The datasets used for testing are ten datasets available in the UCI Machine Learning Repository and Kaggle, which have classification purposes. All tests were compared with the results of feature selection with a threshold of 0.05, the Correlation-Base Feature Selection algorithm, and Random Forest without feature selection. The first research is to determine the threshold value by using the standard deviation of the IG value generated by each feature in the dataset. The first proposed threshold value was tested on eight original datasets and datasets transformed using the Fast Fourier Transform). Testing on the original dataset and the transformed dataset resulted in less execution time in Random Forests with less feature selection compared to Random Forests without feature selection. Over 80% of all datasets take less time than random forest without feature selection. As for the accuracy value, 62.5% of the original dataset has the same accuracy value as the accuracy value generated by Random Forest without feature selection. The determination of the threshold value was also tested using the median value of the IG value generated by each feature in the dataset. Before calculating the median value, the IG value that has been obtained is first transformed using the Fast Fourier Transform. The IG method with a threshold based on the median value produces an average accuracy value that is better than the Correlation Based Feature Selection, Threshold 0.05, and a threshold based on Standard Deviation. However, the average value of accuracy in this method can be increased if using IG based on the median threshold by using real values for the transformed features. Meanwhile, to get the average time needed (speed), a better method is the IG method based on the Standard Deviation threshold. The second proposed threshold value is feature selection, which involves first transforming the IG value using the fast Fourier transform method and then repeatedly looking for the median value. The test uses a total of 9 data sets, consisting of 3 balance data sets and 6 imbalance data sets. Furthermore, to balance the data, the Minority Synthetic Over-Sampling Technique (SMOTE) is used on unbalanced datasets. The results show that feature selection using IG with repeated median thresholds, Fast Fourier Transform, and SMOTE improves the accuracy of Random Forest performance. The model test used is K-Fold Cross Validation with K = 10, and the method of dividing the dataset into two parts 75% for training data and 25% for test data. From the trials conducted, it was found that the GI method with thresholds based on repeated median values (MRT) resulted in an average accuracy value that increased between 0.18% and 3.43% compared to IG based on standard deviation. Meanwhile, when compared to IG based on the median, the average accuracy of IG based on repeated median increased from 1.84% to 5.75%. Meanwhile, to get the average required time (speed), a better method is the IG method based on the standard deviation threshold. The feature selection with the IG algorithm based on the Standard Deviation threshold and the IG based on the MRT threshold were also compared to the K-NN and SVM algorithms. The accuracy value resulting from the two proposals is superior, ranging from 0.0054 to 0.4788 points.