A CLASS CENTER-BASED METHOD FOR MISSING DATA IMPUTATION IN CLASSIFICATION PROBLEMS
Along with the growing size of the data there is a revolution in computational methods and statistics to process and analyse data into insight and knowledge. The main challenge faced is that raw data cannot be directly used for analysis. This is related to the quality of the data. One of the prob...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/70616 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Along with the growing size of the data there is a revolution in computational
methods and statistics to process and analyse data into insight and knowledge.
The main challenge faced is that raw data cannot be directly used for analysis.
This is related to the quality of the data. One of the problems that arises related to
data quality is the completeness of the data. Missing data is one factor that often
causes data to become incomplete. At present, the available analytical methods
can only work with complete data. In quantitative studies, missing data leads to
biased parameter estimates. In the predictive model, the selection of methods for
handling incorrect data missing can affect the performance of the model. The
selection of imputation methods that are incorrect can make the classifier studied
biased and produce a low classification quality in the test data.
Over the past five decades, various methods have been developed for handling
missing data. The literature on the analysis of missing data is very broad and still
developing rapidly. In general, three different strategies for dealing with lost data
are removal, imputation, and use as is. Basically, the three strategies are carried
out in an effort to replace missing data so that the value is obtained, and the data
can be processed according to their needs. Many imputation methods are
computationally expensive and not suitable for large-scale datasets and there is
no universal best imputation method. The occurrence of missing data is a major
concern in machine learning and related fields, including the medical domain.
Methods based on machine learning techniques are the most suitable for the
imputation of missing values. However, most machine learning techniques are
usually more computationally expensive than many statistical techniques except
kNN. More complex algorithms may be able to produce better imputation results
but require higher computational costs
On classification issues, class center-based imputation methods are developed
and are better than other methods for numerical data types and mixed data but
not for category data. Many techniques for dealing with missing data ignore
correlations between data attributes, even if they are only suitable for categorical
data. In fact, the performance of the missing values imputation algorithm is
significantly affected by factors such as the correlation structure in the data. To
vi
estimate missing data by considering correlations and interrelationships between
variables, the adaptive search procedure is one that can be used. The Firefly
Algorithm (FA) applies an adaptive search procedure in imputing missing data by
finding the estimated value that is closest to the value in other known data.
Normalization of data and handling of missing values were considered as major
problems in the pre-processing stage of data when classification algorithms were
adopted to handle numerical features. In addition, if the data is observed to
contain outliers, the estimated missing value results may be unreliable or may
even differ greatly from the true value. In the data category, target encoding uses
information from the target variable, however, it has the risk of being overfitting
and inaccurate in categories that occur rarely in the data. In this dissertation
research, a method for handling missing data based on class center is proposed
by modifying the search pattern on the Firefly Algorithm (FA) based on the
attribute correlation of the data in the imputation process. In addition, the
proposed method also considers data normalization, and the presence of outliers
is also a consideration in the imputation process for numerical data and the use of
smoothing target encoding before the imputation process on categorical data.
The test results on several datasets show that the proposed method can reproduce
the actual values in the data or predictive accuracy (PAC) and has the ability to
maintain the distribution of values from the missing data or distributional
accuracy (DAC). In addition, the proposed method also produces a smaller root
mean squared error (RMSE) than the SVM, KKNI, WRF, FKKNI, and CCMVI
methods. Another contribution of this dissertation research is the effect of outliers
(O) and normalization (N) before the imputation process where the proposed
method, namely ON+C3-FA outperforms the mean imputation, random
imputation, linear regression, multiple imputation and knn imputation methods.
For categorical datasets, the proposed C3FA-STD method produces better AUC,
CA, F1-Score, precision, and recall values and outperforms the imputation mode
method which is the best method in previous studies for categorical data and the
imputation method with decision tree. |
---|