Towards advanced distributed data processing: framework, optimization, and application

The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have pr...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Kaiqi
Other Authors: Mo Li
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/177576
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques.