Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi

Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, hea...

Full description

Saved in:
Bibliographic Details
Main Author: Yazdi, Saeed Reza Aghabozorgi Sahaf
Format: Thesis
Published: 2013
Subjects:
Online Access:http://studentsrepo.um.edu.my/5582/1/Saeed_Thesis.pdf
http://studentsrepo.um.edu.my/5582/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaya
id my.um.stud.5582
record_format eprints
spelling my.um.stud.55822015-06-15T02:30:07Z Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi Yazdi, Saeed Reza Aghabozorgi Sahaf QA75 Electronic computers. Computer science T Technology (General) Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, healthcare, and business, are stored as time-series. Clustering such complex data can discover patterns which have valuable information. Time-series clustering is not only useful as an exploratory technique but also as a subroutine in more complex data mining algorithms. As a result, time-series clustering (as a part of temporal data mining research) has attracted increasing interest for use in various areas such as medicine, biology, finance, economics, and in the Web. Several studies which focus on time-series clustering have been conducted in said areas. Many of these studies focus on the time complexity of time-series clustering in large datasets and utilize dimensionality reduction approaches and conventional clustering algorithms to address the problem. However, as is the case in many systems, conventional clustering approaches are not practical for time-series data because they are essentially designed for static data and not for time-series data, which leads to poor clustering accuracy. Adequate clustering approaches for time-series are therefore lacking. In this thesis, the problem of the low quality in existing works is taken into account, and a new multi-step clustering model is proposed. This model facilitates the accurate clustering of time-series datasets and is designed specifically for very large time-series datasets. It overcomes the limitations of conventional clustering algorithms in dealing with time-series data. In the first step of the model, data is pre-processed, represented by symbolic aggregate approximation, and grouped approximately by a novel approach. Then, the groups are refined in the second step by using an accurate clustering method, and a representative is defined for each cluster. Finally, the representatives are merged to construct the ultimate clusters. The model is then extended as an interactive model where the results garnered by the user increase in accuracy over time. In this work, the accurate clustering based on shape similarity is performed. It is shown that clustering of time-series does not need to calculate the exact distances/similarity between all time-series in a dataset; instead, by using prototypes of similar time-series, accurate clusters can be obtained. To evaluate its accuracy, the proposed model is tested extensively by using published time-series datasets from diverse domains. This model is more accurate than any existing work and is also scalable (on large datasets) due to the use of multi-resolution of time-series in different levels of clustering. Moreover, it provides a clear understanding of the domains by its ability to generate hierarchical and arbitrary shape clusters of time-series data. 2013 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/5582/1/Saeed_Thesis.pdf Yazdi, Saeed Reza Aghabozorgi Sahaf (2013) Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/5582/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA75 Electronic computers. Computer science
T Technology (General)
spellingShingle QA75 Electronic computers. Computer science
T Technology (General)
Yazdi, Saeed Reza Aghabozorgi Sahaf
Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
description Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, healthcare, and business, are stored as time-series. Clustering such complex data can discover patterns which have valuable information. Time-series clustering is not only useful as an exploratory technique but also as a subroutine in more complex data mining algorithms. As a result, time-series clustering (as a part of temporal data mining research) has attracted increasing interest for use in various areas such as medicine, biology, finance, economics, and in the Web. Several studies which focus on time-series clustering have been conducted in said areas. Many of these studies focus on the time complexity of time-series clustering in large datasets and utilize dimensionality reduction approaches and conventional clustering algorithms to address the problem. However, as is the case in many systems, conventional clustering approaches are not practical for time-series data because they are essentially designed for static data and not for time-series data, which leads to poor clustering accuracy. Adequate clustering approaches for time-series are therefore lacking. In this thesis, the problem of the low quality in existing works is taken into account, and a new multi-step clustering model is proposed. This model facilitates the accurate clustering of time-series datasets and is designed specifically for very large time-series datasets. It overcomes the limitations of conventional clustering algorithms in dealing with time-series data. In the first step of the model, data is pre-processed, represented by symbolic aggregate approximation, and grouped approximately by a novel approach. Then, the groups are refined in the second step by using an accurate clustering method, and a representative is defined for each cluster. Finally, the representatives are merged to construct the ultimate clusters. The model is then extended as an interactive model where the results garnered by the user increase in accuracy over time. In this work, the accurate clustering based on shape similarity is performed. It is shown that clustering of time-series does not need to calculate the exact distances/similarity between all time-series in a dataset; instead, by using prototypes of similar time-series, accurate clusters can be obtained. To evaluate its accuracy, the proposed model is tested extensively by using published time-series datasets from diverse domains. This model is more accurate than any existing work and is also scalable (on large datasets) due to the use of multi-resolution of time-series in different levels of clustering. Moreover, it provides a clear understanding of the domains by its ability to generate hierarchical and arbitrary shape clusters of time-series data.
format Thesis
author Yazdi, Saeed Reza Aghabozorgi Sahaf
author_facet Yazdi, Saeed Reza Aghabozorgi Sahaf
author_sort Yazdi, Saeed Reza Aghabozorgi Sahaf
title Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
title_short Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
title_full Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
title_fullStr Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
title_full_unstemmed Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi
title_sort clustering of large time-series datasets using a multi-step approach / saeed reza aghabozorgi sahaf yazdi
publishDate 2013
url http://studentsrepo.um.edu.my/5582/1/Saeed_Thesis.pdf
http://studentsrepo.um.edu.my/5582/
_version_ 1738505808091545600