Advanced classification for streaming time series and data streams
Nowadays, overwhelming volumes of sequential data are very common in scientific and business applications, such as biomedicine, stock markets, retail industry, and communication networks. Time series and data streams are the two most popular types of sequential data. The main difference between them...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/54815 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-54815 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-548152023-03-04T00:47:30Z Advanced classification for streaming time series and data streams Nguyen, Hai Long Ng Wee Keong School of Computer Engineering EADS Innovation Works South Asia Economic Development Board of Singapore Centre for Advanced Information Systems Woon Yew Kwong DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Nowadays, overwhelming volumes of sequential data are very common in scientific and business applications, such as biomedicine, stock markets, retail industry, and communication networks. Time series and data streams are the two most popular types of sequential data. The main difference between them is that time series is on a single variable domain, while data streams are generally on a multivariate domain. However, they do share some unique characteristics: possibly infinite volume, time-ordered and dynamically changing. In this dissertation, we propose classification algorithms for time series and data streams that satisfy strict constraints, such as bounded memory, single pass, real-time response, and concept-drift detection. Here, a concept drift refers to the situation where the data's underlying distribution changes over time. For massive time series datasets, classification algorithms that are based on motifs (frequent subsequences) are preferable since it not only has low complexity but can also achieve high accuracy. However, state-of-the-art algorithms can only find motifs with a predefined length, which greatly affects their performance and practicality. To overcome this challenge, we introduce the notion of a closed motif; a motif is closed if there is no motif with a longer length having the same number of occurrences. We also propose a novel closed-motif-based classifier, which is lightweight, effective and efficient for time series classification. Furthermore, we continue to examine a more challenging problem of classifying data streams in a multivariate domain. Here, we are confronted with a feature drift problem, where the importance/relevance of a set of features will change over time. We propose a general framework to integrate feature selection and heterogeneous ensemble learning, which is able to adapt to different types of concept drifts and works well with various kinds of datasets. The ensemble consists of well-chosen online classifiers and is equipped with an optimal weighting method. It updates online classifier members for gradual drifts, and replace outdated members by new ones for feature drifts. Additionally, we extend our algorithms in a practical environment, where labeled data is very scarce and there is a need for the concurrent mining of data streams in order to make full use of the single-pass data. Conventional stream mining algorithms only focus on stand-alone mining tasks. Therefore, we propose an incremental algorithm that performs clustering and classification concurrently, which not only maximize throughput, but also achieve better mining results. Moreover, enhanced with a novel active learning technique, our algorithm only requires a small number of queries to work well with very sparsely labeled data streams. Finally, as the volume of sequential data grows steadily, a single computer with limited computing power may soon be insufficient for the mining processes. Cloud computing, a cutting-edge technology that provides elastic computing on demand, will certainly facilitate large sequential data mining. Therefore, we plan to adapt and migrate our algorithms to a cloud computing platform in the future. DOCTOR OF PHILOSOPHY (SCE) 2013-08-27T09:14:07Z 2013-08-27T09:14:07Z 2012 2012 Thesis Nguyen, H. L. (2012). Advanced classification for streaming time series and data streams. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/54815 10.32657/10356/54815 en 167 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications Nguyen, Hai Long Advanced classification for streaming time series and data streams |
description |
Nowadays, overwhelming volumes of sequential data are very common in scientific and business applications, such as biomedicine, stock markets, retail industry, and communication networks. Time series and data streams are the two most popular types of sequential data. The main difference between them is that time series is on a single variable domain, while data streams are generally on a multivariate domain. However, they do share some unique characteristics: possibly infinite volume, time-ordered and dynamically changing. In this dissertation, we propose classification algorithms for time series and data streams that satisfy strict constraints, such as bounded memory, single pass, real-time response, and concept-drift detection. Here, a concept drift refers to the situation where the data's underlying distribution changes over time.
For massive time series datasets, classification algorithms that are based on motifs (frequent subsequences) are preferable since it not only has low complexity but can also achieve high accuracy. However, state-of-the-art algorithms can only find motifs with a predefined length, which greatly affects their performance and practicality. To overcome this challenge, we introduce the notion of a closed motif; a motif is closed if there is no motif with a longer length having the same number of occurrences. We also propose a novel closed-motif-based classifier, which is lightweight, effective and efficient for time series classification.
Furthermore, we continue to examine a more challenging problem of classifying data streams in a multivariate domain. Here, we are confronted with a feature drift problem, where the importance/relevance of a set of features will change over time. We propose a general framework to integrate feature selection and heterogeneous ensemble learning, which is able to adapt to different types of concept drifts and works well with various kinds of datasets. The ensemble consists of well-chosen online classifiers and is equipped with an optimal weighting method. It updates online classifier members for gradual drifts, and replace outdated members by new ones for feature drifts.
Additionally, we extend our algorithms in a practical environment, where labeled data is very scarce and there is a need for the concurrent mining of data streams in order to make full use of the single-pass data. Conventional stream mining algorithms only focus on stand-alone mining tasks. Therefore, we propose an incremental algorithm that performs clustering and classification concurrently, which not only maximize throughput, but also achieve better mining results. Moreover, enhanced with a novel active learning technique, our algorithm only requires a small number of queries to work well with very sparsely labeled data streams.
Finally, as the volume of sequential data grows steadily, a single computer with limited computing power may soon be insufficient for the mining processes. Cloud computing, a cutting-edge technology that provides elastic computing on demand, will certainly facilitate large sequential data mining. Therefore, we plan to adapt and migrate our algorithms to a cloud computing platform in the future. |
author2 |
Ng Wee Keong |
author_facet |
Ng Wee Keong Nguyen, Hai Long |
format |
Theses and Dissertations |
author |
Nguyen, Hai Long |
author_sort |
Nguyen, Hai Long |
title |
Advanced classification for streaming time series and data streams |
title_short |
Advanced classification for streaming time series and data streams |
title_full |
Advanced classification for streaming time series and data streams |
title_fullStr |
Advanced classification for streaming time series and data streams |
title_full_unstemmed |
Advanced classification for streaming time series and data streams |
title_sort |
advanced classification for streaming time series and data streams |
publishDate |
2013 |
url |
https://hdl.handle.net/10356/54815 |
_version_ |
1759853038327037952 |