THE FEATURE SELECTION ALGORITHM MODIFICATION IN THE RANDOM FOREST
Random Forest is a tree-based machine learning algorithm with an informative random feature selection process. One of the methods used to determine the level of importance in a dataset is Information Gain (IG). This process is used to calculate the amount of information contained in features with...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/70612 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Random Forest is a tree-based machine learning algorithm with an informative
random feature selection process. One of the methods used to determine the level
of importance in a dataset is Information Gain (IG). This process is used to
calculate the amount of information contained in features with high information
value selected to speed up the performance of an algorithm. In selecting informative
features, IG uses a cut-off value. Mostly the threshold value being used is free, or
0,05. This study proposes a modification to the IG algorithm by determining the
threshold value based on the standard deviation, median, and real median values
for the transformed features. The main objective of this research is to find the
execution speed of Random Forest while still paying attention to the accuracy value
generated. The datasets used for testing are ten datasets available in the UCI
Machine Learning Repository and Kaggle, which have classification purposes. All
tests were compared with the results of feature selection with a threshold of 0.05,
the Correlation-Base Feature Selection algorithm, and Random Forest without
feature selection.
The first research is to determine the threshold value by using the standard
deviation of the IG value generated by each feature in the dataset. The first
proposed threshold value was tested on eight original datasets and datasets
transformed using the Fast Fourier Transform). Testing on the original dataset and
the transformed dataset resulted in less execution time in Random Forests with less
feature selection compared to Random Forests without feature selection. Over 80%
of all datasets take less time than random forest without feature selection. As for
the accuracy value, 62.5% of the original dataset has the same accuracy value as
the accuracy value generated by Random Forest without feature selection.
The determination of the threshold value was also tested using the median value of
the IG value generated by each feature in the dataset. Before calculating the median
value, the IG value that has been obtained is first transformed using the Fast
Fourier Transform. The IG method with a threshold based on the median value
produces an average accuracy value that is better than the Correlation Based
Feature Selection, Threshold 0.05, and a threshold based on Standard Deviation.
However, the average value of accuracy in this method can be increased if using IG based on the median threshold by using real values for the transformed features.
Meanwhile, to get the average time needed (speed), a better method is the IG
method based on the Standard Deviation threshold.
The second proposed threshold value is feature selection, which involves first
transforming the IG value using the fast Fourier transform method and then
repeatedly looking for the median value. The test uses a total of 9 data sets,
consisting of 3 balance data sets and 6 imbalance data sets. Furthermore, to
balance the data, the Minority Synthetic Over-Sampling Technique (SMOTE) is
used on unbalanced datasets. The results show that feature selection using IG with
repeated median thresholds, Fast Fourier Transform, and SMOTE improves the
accuracy of Random Forest performance. The model test used is K-Fold Cross
Validation with K = 10, and the method of dividing the dataset into two parts 75%
for training data and 25% for test data. From the trials conducted, it was found that
the GI method with thresholds based on repeated median values (MRT) resulted in
an average accuracy value that increased between 0.18% and 3.43% compared to
IG based on standard deviation. Meanwhile, when compared to IG based on the
median, the average accuracy of IG based on repeated median increased from
1.84% to 5.75%. Meanwhile, to get the average required time (speed), a better
method is the IG method based on the standard deviation threshold.
The feature selection with the IG algorithm based on the Standard Deviation
threshold and the IG based on the MRT threshold were also compared to the K-NN
and SVM algorithms. The accuracy value resulting from the two proposals is
superior, ranging from 0.0054 to 0.4788 points. |
---|