A spark-based parallel fuzzy C median algorithm for web log big data

Now-a-days, the World Wide Web (WWW) is regarded as an exceptionally large data storehouse. The WWW is becoming more complicated and substantive every day. At the moment, the situation is such that we are starved for knowledge while drowning in data. Due to these factors, the data mining clustering...

Full description

Saved in:
Bibliographic Details
Main Authors: Mallik, Moksud Alam, Zulkurnain, Nurul Fariza, Nizamuddin, Mohammed Khaja, Sarkar, Rashal, Chalil, Aboosalih Kakkat
Format: Article
Language:English
English
Published: International Organization of IOTPE 2022
Subjects:
Online Access:http://irep.iium.edu.my/102189/7/102189_A%20spark-based%20parallel%20fuzzy%20C_SCOPUS.pdf
http://irep.iium.edu.my/102189/8/102189_A%20spark-based%20parallel%20fuzzy%20C.pdf
http://irep.iium.edu.my/102189/
https://www.iotpe.com/IJTPE/IJTPE-2022/IJTPE-Issue52-Vol14-No3-Sep2022/29-IJTPE-Issue52-Vol14-No3-Sep2022-pp212-220.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Islam Antarabangsa Malaysia
Language: English
English
Description
Summary:Now-a-days, the World Wide Web (WWW) is regarded as an exceptionally large data storehouse. The WWW is becoming more complicated and substantive every day. At the moment, the situation is such that we are starved for knowledge while drowning in data. Due to these factors, the data mining clustering technique is one of the most crucial tools for collecting useful data from the web. Clustering techniques for small datasets have led to the development of numerous successful clustering techniques. Nevertheless, these techniques do not provide adequate results when trading with extensive data sets. The most important problems are excessive computational difficulty and lengthy evaluating time, which is not acceptable for real-time context. It is very prime to process this enormous information on time. This paper proposes an efficient parallel Fuzzy C median solution based on Spark for large-scale web log data. Based on the Rand Index and SSE (sum of squared error), the parallel Fuzzy C median algorithm's performance is evaluated in the PySpark platform. According to the experimental findings, the parallel Fuzzy C median method built on Spark performs better.