Towards advanced distributed data processing: framework, optimization, and application
The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have pr...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/177576 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-177576 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1775762024-06-03T06:51:20Z Towards advanced distributed data processing: framework, optimization, and application Liu, Kaiqi Mo Li School of Computer Science and Engineering Alibaba-NTU Singapore Joint Research Institute limo@ntu.edu.sg Computer and Information Science The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques. Doctor of Philosophy 2024-05-29T04:45:33Z 2024-05-29T04:45:33Z 2024 Thesis-Doctor of Philosophy Liu, K. (2024). Towards advanced distributed data processing: framework, optimization, and application. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177576 https://hdl.handle.net/10356/177576 10.32657/10356/177576 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science |
spellingShingle |
Computer and Information Science Liu, Kaiqi Towards advanced distributed data processing: framework, optimization, and application |
description |
The surge in available big data has drawn significant interest in distributed processing methods capable of handling the ever-expanding data volume and increasing computational complexities efficiently and at scale. While existing distributed data processing frameworks, such as Apache Spark, have proven effective in various applications, there is still considerable room for improvement and exploration in this field. This thesis focuses on three key aspects of advancing distributed data processing using Apache Spark. First, a novel framework is introduced to extend Spark’s capabilities, enabling the efficient processing of large-scale spatio-temporal data to better serve machine-learning applications. This framework not only achieves high efficiency but also provides a user-friendly interface. Second, a deep-learning-based optimization approach tailored to enhance the efficiency of Spark SQL execution is proposed. The end-to-end system integration of this approach leads to practical performance gains. Last, a distributed solution for the computational-intensive large-scale microscopic crowd simulation is designed and implemented aiming to improve the scalability and efficiency of such applications. These three works collectively expand the application of distributed computing and enhance efficiency through the implementation of state-of-the-art techniques. |
author2 |
Mo Li |
author_facet |
Mo Li Liu, Kaiqi |
format |
Thesis-Doctor of Philosophy |
author |
Liu, Kaiqi |
author_sort |
Liu, Kaiqi |
title |
Towards advanced distributed data processing: framework, optimization, and application |
title_short |
Towards advanced distributed data processing: framework, optimization, and application |
title_full |
Towards advanced distributed data processing: framework, optimization, and application |
title_fullStr |
Towards advanced distributed data processing: framework, optimization, and application |
title_full_unstemmed |
Towards advanced distributed data processing: framework, optimization, and application |
title_sort |
towards advanced distributed data processing: framework, optimization, and application |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/177576 |
_version_ |
1806059925243166720 |