Parallel graph processing on graphics processing units
Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GP...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/59975 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-59975 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Computer science and engineering |
spellingShingle |
DRNTU::Engineering::Computer science and engineering Zhong, Jianlong Parallel graph processing on graphics processing units |
description |
Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs. |
author2 |
He Bingsheng |
author_facet |
He Bingsheng Zhong, Jianlong |
format |
Theses and Dissertations |
author |
Zhong, Jianlong |
author_sort |
Zhong, Jianlong |
title |
Parallel graph processing on graphics processing units |
title_short |
Parallel graph processing on graphics processing units |
title_full |
Parallel graph processing on graphics processing units |
title_fullStr |
Parallel graph processing on graphics processing units |
title_full_unstemmed |
Parallel graph processing on graphics processing units |
title_sort |
parallel graph processing on graphics processing units |
publishDate |
2014 |
url |
https://hdl.handle.net/10356/59975 |
_version_ |
1759853038868103168 |
spelling |
sg-ntu-dr.10356-599752023-03-04T00:42:07Z Parallel graph processing on graphics processing units Zhong, Jianlong He Bingsheng School of Computer Engineering DRNTU::Engineering::Computer science and engineering Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs. DOCTOR OF PHILOSOPHY (SCE) 2014-05-21T05:59:11Z 2014-05-21T05:59:11Z 2014 2014 Thesis Zhong, J. (2014). Parallel graph processing on graphics processing units. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/59975 10.32657/10356/59975 en 142 p. application/pdf |