In-memory analytical query processing on GPUs
The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for high-performance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/146888 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-146888 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computer systems organization::Performance of systems Engineering::Computer science and engineering::Computer systems organization::Computer system implementation |
spellingShingle |
Engineering::Computer science and engineering::Computer systems organization::Performance of systems Engineering::Computer science and engineering::Computer systems organization::Computer system implementation Paul, Johns In-memory analytical query processing on GPUs |
description |
The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for high-performance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for GPUs due to the following reasons: 1) the rapid evolution of GPU hardware in the last decade, 2) the significant differences in the hardware architecture of GPUs when compared to CPUs, 3) the high overhead of moving the data between the CPU and GPU and 4) the small global memory size of a single GPU that necessitates the access of remote data over PCIe or NVLink interconnects when processing large data sets. In this thesis, we study existing query processing systems for GPUs and propose techniques to improve query execution efficiency on GPUs.
We begin by studying the performance of hash join, which is one of the most compute and memory intensive relational operator in OLAP systems. Specifically, we first revisit the major GPU hash join implementations in the past decade that were designed to execute on a single GPU. We then detail how these implementations take advantage of different GPU architecture features and conduct a comprehensive evaluation of their performance and cost-efficiency using different generations of GPUs. This helps shed light on the impact of different GPU architecture features on the performance of the hash join operation and identify the factors guiding the choice of these features. We further study how data characteristics like skew and match rate impact the performance of GPU hash join implementations. Novel techniques to improve the performance of the hash join operation when joining input relations with severe skew or high match rate were also proposed as part of this study. Our evaluation finds that the proposed techniques help avoid any performance degradation when joining skewed input relations and achieve up to 2.5x better performance when joining input relations with high match rate.
Next, we extend our study on the hash join operation to modern multi-GPU architectures. The recent scale-up of GPU hardware through the integration of multiple GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalability of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this thesis, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses the efficient route for each data flow to minimize congestion. Our experiments on the DGX-1 machine show that MG-Join helps significantly reduce the communication overhead and achieves up to 97% utilization of the bisection bandwidth of the interconnects, resulting in significantly better scalability. Overall, MG-Join outperforms the state-of-the-art hash join implementations by up to 2.5x. MG-Join further helps improve the overall performance of TPC-H queries by up to 4.5x over multi-GPU version of an open-source commercial GPU database Omnisci.
Finally, we focus on optimizing the execution of entire queries on GPUs. Specifically, we look at improving the execution efficiency of just-in-time (JIT) compilation based query processing systems on GPUs. We observe that JIT compile-based query processing encounters severe under-utilization of GPU hardware} due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to reduce divergent execution and improve hardware utilization during query execution on GPU. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention improves the accuracy of the cost model and 2) Pyper on average can improve the performance of TPC-H and SSB queries by up to 5.01x and 3.06x when compared to Omnisci and Hawk, respectively. |
author2 |
Lau Chiew Tong |
author_facet |
Lau Chiew Tong Paul, Johns |
format |
Thesis-Doctor of Philosophy |
author |
Paul, Johns |
author_sort |
Paul, Johns |
title |
In-memory analytical query processing on GPUs |
title_short |
In-memory analytical query processing on GPUs |
title_full |
In-memory analytical query processing on GPUs |
title_fullStr |
In-memory analytical query processing on GPUs |
title_full_unstemmed |
In-memory analytical query processing on GPUs |
title_sort |
in-memory analytical query processing on gpus |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/146888 |
_version_ |
1698713665087209472 |
spelling |
sg-ntu-dr.10356-1468882021-04-20T07:00:35Z In-memory analytical query processing on GPUs Paul, Johns Lau Chiew Tong School of Computer Science and Engineering Parallel and Distributed Computing Centre ASCTLAU@ntu.edu.sg Engineering::Computer science and engineering::Computer systems organization::Performance of systems Engineering::Computer science and engineering::Computer systems organization::Computer system implementation The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for high-performance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for GPUs due to the following reasons: 1) the rapid evolution of GPU hardware in the last decade, 2) the significant differences in the hardware architecture of GPUs when compared to CPUs, 3) the high overhead of moving the data between the CPU and GPU and 4) the small global memory size of a single GPU that necessitates the access of remote data over PCIe or NVLink interconnects when processing large data sets. In this thesis, we study existing query processing systems for GPUs and propose techniques to improve query execution efficiency on GPUs. We begin by studying the performance of hash join, which is one of the most compute and memory intensive relational operator in OLAP systems. Specifically, we first revisit the major GPU hash join implementations in the past decade that were designed to execute on a single GPU. We then detail how these implementations take advantage of different GPU architecture features and conduct a comprehensive evaluation of their performance and cost-efficiency using different generations of GPUs. This helps shed light on the impact of different GPU architecture features on the performance of the hash join operation and identify the factors guiding the choice of these features. We further study how data characteristics like skew and match rate impact the performance of GPU hash join implementations. Novel techniques to improve the performance of the hash join operation when joining input relations with severe skew or high match rate were also proposed as part of this study. Our evaluation finds that the proposed techniques help avoid any performance degradation when joining skewed input relations and achieve up to 2.5x better performance when joining input relations with high match rate. Next, we extend our study on the hash join operation to modern multi-GPU architectures. The recent scale-up of GPU hardware through the integration of multiple GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalability of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this thesis, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses the efficient route for each data flow to minimize congestion. Our experiments on the DGX-1 machine show that MG-Join helps significantly reduce the communication overhead and achieves up to 97% utilization of the bisection bandwidth of the interconnects, resulting in significantly better scalability. Overall, MG-Join outperforms the state-of-the-art hash join implementations by up to 2.5x. MG-Join further helps improve the overall performance of TPC-H queries by up to 4.5x over multi-GPU version of an open-source commercial GPU database Omnisci. Finally, we focus on optimizing the execution of entire queries on GPUs. Specifically, we look at improving the execution efficiency of just-in-time (JIT) compilation based query processing systems on GPUs. We observe that JIT compile-based query processing encounters severe under-utilization of GPU hardware} due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to reduce divergent execution and improve hardware utilization during query execution on GPU. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention improves the accuracy of the cost model and 2) Pyper on average can improve the performance of TPC-H and SSB queries by up to 5.01x and 3.06x when compared to Omnisci and Hawk, respectively. Doctor of Philosophy 2021-03-12T07:26:35Z 2021-03-12T07:26:35Z 2021 Thesis-Doctor of Philosophy Paul, J. (2021). In-memory analytical query processing on GPUs. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/146888 https://hdl.handle.net/10356/146888 10.32657/10356/146888 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |