Evolving large-scale data stream analytics based on scalable PANFIS
The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/151672 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-151672 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1516722021-07-14T07:16:22Z Evolving large-scale data stream analytics based on scalable PANFIS Za'in, Choiru Pratama, Mahardhika Pardede, Eric School of Computer Science and Engineering Engineering::Computer science and engineering Large-scale Data Stream Analytics Distributed Data Stream Mining The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying a multi-class label dataset. Ministry of Education (MOE) Nanyang Technological University This project is fully supported by NTU, Singapore start up grant and MOE tier 1 research grant. This research is also supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS). 2021-07-14T07:16:22Z 2021-07-14T07:16:22Z 2019 Journal Article Za'in, C., Pratama, M. & Pardede, E. (2019). Evolving large-scale data stream analytics based on scalable PANFIS. Knowledge-Based Systems, 166, 186-197. https://dx.doi.org/10.1016/j.knosys.2018.12.028 0950-7051 https://hdl.handle.net/10356/151672 10.1016/j.knosys.2018.12.028 2-s2.0-85059525095 166 186 197 en Knowledge-Based Systems © 2019 Elsevier B.V. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Large-scale Data Stream Analytics Distributed Data Stream Mining |
spellingShingle |
Engineering::Computer science and engineering Large-scale Data Stream Analytics Distributed Data Stream Mining Za'in, Choiru Pratama, Mahardhika Pardede, Eric Evolving large-scale data stream analytics based on scalable PANFIS |
description |
The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying a multi-class label dataset. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Za'in, Choiru Pratama, Mahardhika Pardede, Eric |
format |
Article |
author |
Za'in, Choiru Pratama, Mahardhika Pardede, Eric |
author_sort |
Za'in, Choiru |
title |
Evolving large-scale data stream analytics based on scalable PANFIS |
title_short |
Evolving large-scale data stream analytics based on scalable PANFIS |
title_full |
Evolving large-scale data stream analytics based on scalable PANFIS |
title_fullStr |
Evolving large-scale data stream analytics based on scalable PANFIS |
title_full_unstemmed |
Evolving large-scale data stream analytics based on scalable PANFIS |
title_sort |
evolving large-scale data stream analytics based on scalable panfis |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/151672 |
_version_ |
1707050442763010048 |