A CLASS CENTER-BASED METHOD FOR MISSING DATA IMPUTATION IN CLASSIFICATION PROBLEMS

Along with the growing size of the data there is a revolution in computational methods and statistics to process and analyse data into insight and knowledge. The main challenge faced is that raw data cannot be directly used for analysis. This is related to the quality of the data. One of the prob...

Full description

Saved in:
Bibliographic Details
Main Author: Nugroho, Heru
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/70616
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Along with the growing size of the data there is a revolution in computational methods and statistics to process and analyse data into insight and knowledge. The main challenge faced is that raw data cannot be directly used for analysis. This is related to the quality of the data. One of the problems that arises related to data quality is the completeness of the data. Missing data is one factor that often causes data to become incomplete. At present, the available analytical methods can only work with complete data. In quantitative studies, missing data leads to biased parameter estimates. In the predictive model, the selection of methods for handling incorrect data missing can affect the performance of the model. The selection of imputation methods that are incorrect can make the classifier studied biased and produce a low classification quality in the test data. Over the past five decades, various methods have been developed for handling missing data. The literature on the analysis of missing data is very broad and still developing rapidly. In general, three different strategies for dealing with lost data are removal, imputation, and use as is. Basically, the three strategies are carried out in an effort to replace missing data so that the value is obtained, and the data can be processed according to their needs. Many imputation methods are computationally expensive and not suitable for large-scale datasets and there is no universal best imputation method. The occurrence of missing data is a major concern in machine learning and related fields, including the medical domain. Methods based on machine learning techniques are the most suitable for the imputation of missing values. However, most machine learning techniques are usually more computationally expensive than many statistical techniques except kNN. More complex algorithms may be able to produce better imputation results but require higher computational costs On classification issues, class center-based imputation methods are developed and are better than other methods for numerical data types and mixed data but not for category data. Many techniques for dealing with missing data ignore correlations between data attributes, even if they are only suitable for categorical data. In fact, the performance of the missing values imputation algorithm is significantly affected by factors such as the correlation structure in the data. To vi estimate missing data by considering correlations and interrelationships between variables, the adaptive search procedure is one that can be used. The Firefly Algorithm (FA) applies an adaptive search procedure in imputing missing data by finding the estimated value that is closest to the value in other known data. Normalization of data and handling of missing values were considered as major problems in the pre-processing stage of data when classification algorithms were adopted to handle numerical features. In addition, if the data is observed to contain outliers, the estimated missing value results may be unreliable or may even differ greatly from the true value. In the data category, target encoding uses information from the target variable, however, it has the risk of being overfitting and inaccurate in categories that occur rarely in the data. In this dissertation research, a method for handling missing data based on class center is proposed by modifying the search pattern on the Firefly Algorithm (FA) based on the attribute correlation of the data in the imputation process. In addition, the proposed method also considers data normalization, and the presence of outliers is also a consideration in the imputation process for numerical data and the use of smoothing target encoding before the imputation process on categorical data. The test results on several datasets show that the proposed method can reproduce the actual values in the data or predictive accuracy (PAC) and has the ability to maintain the distribution of values from the missing data or distributional accuracy (DAC). In addition, the proposed method also produces a smaller root mean squared error (RMSE) than the SVM, KKNI, WRF, FKKNI, and CCMVI methods. Another contribution of this dissertation research is the effect of outliers (O) and normalization (N) before the imputation process where the proposed method, namely ON+C3-FA outperforms the mean imputation, random imputation, linear regression, multiple imputation and knn imputation methods. For categorical datasets, the proposed C3FA-STD method produces better AUC, CA, F1-Score, precision, and recall values and outperforms the imputation mode method which is the best method in previous studies for categorical data and the imputation method with decision tree.