MAHALANOBIS DISTANCE BASED OUTLIER DETECTION FOR DATA STREAM
An outlier is a data point that is very different from the rest of the data. An outlier is also referred to as an abnormality, mismatch, deviation, or anomaly in data mining and statistical literature. Outlier detection is essential to improve the quality of datasets in the knowledge discovery pr...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/87439 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | An outlier is a data point that is very different from the rest of the data. An outlier is
also referred to as an abnormality, mismatch, deviation, or anomaly in data mining
and statistical literature. Outlier detection is essential to improve the quality of
datasets in the knowledge discovery process or in mathematical modeling. A data
stream is a sequence of data points that are uncertain, dynamic, and infinite. Outlier
detection in the data stream can be processed using offline learning and online
learning. Offline learning is an approach in machine learning that processes all
data at one time to build a model, while online learning is an approach in machine
learning that updates the model incrementally. However, since the data stream is
infinite, the entire data stream requires a lot of time and computation. Meanwhile,
offline outlier detection in the data stream has the disadvantage of being time and
computationally expensive. Therefore, outlier detection in the data stream is more
suitable using the online learning approach.
This study aims to develop an online outlier detection algorithm on data streams by
applying a recursive algorithm that estimates an iterative formula to update parameters
in the model when new data appears and detects outliers. Modification of
multivariate analysis Principal Component Analysis (PCA) is chosen for outlier
detection since the data has more than one variable where there is a correlation
between variables, the sensitivity of the PCA method to outliers makes it possible
for this method to detect extreme variations in the data, and the eigendecomposition
of the data covariance matrix. Furthermore, since this parameter estimate is
the result of the eigendecomposition of the covariance matrix, this study uses the
Mahalanobis distance to calculate the outlier score.
This study designs an outlier detection algorithm with Mahalanobis distance
processed with offline and online learning approaches, where the offline algorithm
is the baseline of the online algorithm. The difference is that the offline algorithm
in this study is based on the classical PCA method to update its model parameters,
while this online algorithm is based on the modified recursive PCA method
to update its model parameters. Since another assumption of this study is that the
changes in the covariance matrix of incoming data do not change drastically, so that
the eigendecomposition is approximated by first-order perturbation analysis. To
identify outliers, the Mahalanobis distance is used as the outlier score. Furthermore, the outlier detection algorithm is not only designed for the arrival of a single data
point but is also developed for the arrival of mini-batch data. The type of outlier
identified in the study is a global outliers (point anomalies).
Simulations were performed on several synthetic datasets and a real video to
simulate the offline and online outlier detection algorithms for both single-point data
arrival types and mini-batch data arrival types. The results of this study’s simulations
concluded that in synthetic datasets, the effectiveness of the online algorithm’s
performance is the same as the offline algorithm for single-point data arrivals and
the effectiveness of the online algorithm’s absolute difference is no greater than
0.03 with the offline algorithm for mini-batch data arrivals of more than one data
point. Then, in real datasets, the effectiveness of the online algorithm’s absolute
difference is no greater than 0.25 with the offline algorithm for both single-point
data arrivals and mini-batch data, while the efficiency of the online algorithm’s
performance is higher than the offline algorithm for both single-point data arrivals
and mini-batch data. Furthermore, from both types of synthetic and real datasets,
it can also be concluded that the efficiency of the outlier detection algorithm both
offline and online for mini-batch data arrival types is higher than for single-point
data arrival types, and the larger the mini-batch size, the higher the efficiency of the
outlier detection algorithm. |
---|