MAHALANOBIS DISTANCE BASED OUTLIER DETECTION FOR DATA STREAM

An outlier is a data point that is very different from the rest of the data. An outlier is also referred to as an abnormality, mismatch, deviation, or anomaly in data mining and statistical literature. Outlier detection is essential to improve the quality of datasets in the knowledge discovery pr...

Full description

Saved in:
Bibliographic Details
Main Author: Dani, Yasi
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/87439
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:An outlier is a data point that is very different from the rest of the data. An outlier is also referred to as an abnormality, mismatch, deviation, or anomaly in data mining and statistical literature. Outlier detection is essential to improve the quality of datasets in the knowledge discovery process or in mathematical modeling. A data stream is a sequence of data points that are uncertain, dynamic, and infinite. Outlier detection in the data stream can be processed using offline learning and online learning. Offline learning is an approach in machine learning that processes all data at one time to build a model, while online learning is an approach in machine learning that updates the model incrementally. However, since the data stream is infinite, the entire data stream requires a lot of time and computation. Meanwhile, offline outlier detection in the data stream has the disadvantage of being time and computationally expensive. Therefore, outlier detection in the data stream is more suitable using the online learning approach. This study aims to develop an online outlier detection algorithm on data streams by applying a recursive algorithm that estimates an iterative formula to update parameters in the model when new data appears and detects outliers. Modification of multivariate analysis Principal Component Analysis (PCA) is chosen for outlier detection since the data has more than one variable where there is a correlation between variables, the sensitivity of the PCA method to outliers makes it possible for this method to detect extreme variations in the data, and the eigendecomposition of the data covariance matrix. Furthermore, since this parameter estimate is the result of the eigendecomposition of the covariance matrix, this study uses the Mahalanobis distance to calculate the outlier score. This study designs an outlier detection algorithm with Mahalanobis distance processed with offline and online learning approaches, where the offline algorithm is the baseline of the online algorithm. The difference is that the offline algorithm in this study is based on the classical PCA method to update its model parameters, while this online algorithm is based on the modified recursive PCA method to update its model parameters. Since another assumption of this study is that the changes in the covariance matrix of incoming data do not change drastically, so that the eigendecomposition is approximated by first-order perturbation analysis. To identify outliers, the Mahalanobis distance is used as the outlier score. Furthermore, the outlier detection algorithm is not only designed for the arrival of a single data point but is also developed for the arrival of mini-batch data. The type of outlier identified in the study is a global outliers (point anomalies). Simulations were performed on several synthetic datasets and a real video to simulate the offline and online outlier detection algorithms for both single-point data arrival types and mini-batch data arrival types. The results of this study’s simulations concluded that in synthetic datasets, the effectiveness of the online algorithm’s performance is the same as the offline algorithm for single-point data arrivals and the effectiveness of the online algorithm’s absolute difference is no greater than 0.03 with the offline algorithm for mini-batch data arrivals of more than one data point. Then, in real datasets, the effectiveness of the online algorithm’s absolute difference is no greater than 0.25 with the offline algorithm for both single-point data arrivals and mini-batch data, while the efficiency of the online algorithm’s performance is higher than the offline algorithm for both single-point data arrivals and mini-batch data. Furthermore, from both types of synthetic and real datasets, it can also be concluded that the efficiency of the outlier detection algorithm both offline and online for mini-batch data arrival types is higher than for single-point data arrival types, and the larger the mini-batch size, the higher the efficiency of the outlier detection algorithm.