Evolving large-scale data stream analytics based on scalable PANFIS

The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale...

Full description

Saved in:
Bibliographic Details
Main Authors: Za'in, Choiru, Pratama, Mahardhika, Pardede, Eric
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151672
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-151672
record_format dspace
spelling sg-ntu-dr.10356-1516722021-07-14T07:16:22Z Evolving large-scale data stream analytics based on scalable PANFIS Za'in, Choiru Pratama, Mahardhika Pardede, Eric School of Computer Science and Engineering Engineering::Computer science and engineering Large-scale Data Stream Analytics Distributed Data Stream Mining The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying a multi-class label dataset. Ministry of Education (MOE) Nanyang Technological University This project is fully supported by NTU, Singapore start up grant and MOE tier 1 research grant. This research is also supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS). 2021-07-14T07:16:22Z 2021-07-14T07:16:22Z 2019 Journal Article Za'in, C., Pratama, M. & Pardede, E. (2019). Evolving large-scale data stream analytics based on scalable PANFIS. Knowledge-Based Systems, 166, 186-197. https://dx.doi.org/10.1016/j.knosys.2018.12.028 0950-7051 https://hdl.handle.net/10356/151672 10.1016/j.knosys.2018.12.028 2-s2.0-85059525095 166 186 197 en Knowledge-Based Systems © 2019 Elsevier B.V. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Large-scale Data Stream Analytics
Distributed Data Stream Mining
spellingShingle Engineering::Computer science and engineering
Large-scale Data Stream Analytics
Distributed Data Stream Mining
Za'in, Choiru
Pratama, Mahardhika
Pardede, Eric
Evolving large-scale data stream analytics based on scalable PANFIS
description The main challenge in large-scale data stream analytics lies in the ability of machine learning to generate large-scale data knowledge in reasonable timeframe without suffering from a loss of accuracy. Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying a multi-class label dataset.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Za'in, Choiru
Pratama, Mahardhika
Pardede, Eric
format Article
author Za'in, Choiru
Pratama, Mahardhika
Pardede, Eric
author_sort Za'in, Choiru
title Evolving large-scale data stream analytics based on scalable PANFIS
title_short Evolving large-scale data stream analytics based on scalable PANFIS
title_full Evolving large-scale data stream analytics based on scalable PANFIS
title_fullStr Evolving large-scale data stream analytics based on scalable PANFIS
title_full_unstemmed Evolving large-scale data stream analytics based on scalable PANFIS
title_sort evolving large-scale data stream analytics based on scalable panfis
publishDate 2021
url https://hdl.handle.net/10356/151672
_version_ 1707050442763010048