Towards advanced distributed data processing: framework, optimization, and application

The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have pr...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Kaiqi
Other Authors: Mo Li
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/177576
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-177576
record_format dspace
spelling sg-ntu-dr.10356-1775762024-06-03T06:51:20Z Towards advanced distributed data processing: framework, optimization, and application Liu, Kaiqi Mo Li School of Computer Science and Engineering Alibaba-NTU Singapore Joint Research Institute limo@ntu.edu.sg Computer and Information Science The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques. Doctor of Philosophy 2024-05-29T04:45:33Z 2024-05-29T04:45:33Z 2024 Thesis-Doctor of Philosophy Liu, K. (2024). Towards advanced distributed data processing: framework, optimization, and application. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177576 https://hdl.handle.net/10356/177576 10.32657/10356/177576 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
spellingShingle Computer and Information Science
Liu, Kaiqi
Towards advanced distributed data processing: framework, optimization, and application
description The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques.
author2 Mo Li
author_facet Mo Li
Liu, Kaiqi
format Thesis-Doctor of Philosophy
author Liu, Kaiqi
author_sort Liu, Kaiqi
title Towards advanced distributed data processing: framework, optimization, and application
title_short Towards advanced distributed data processing: framework, optimization, and application
title_full Towards advanced distributed data processing: framework, optimization, and application
title_fullStr Towards advanced distributed data processing: framework, optimization, and application
title_full_unstemmed Towards advanced distributed data processing: framework, optimization, and application
title_sort towards advanced distributed data processing: framework, optimization, and application
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/177576
_version_ 1806059925243166720