Parallel graph processing on graphics processing units

Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GP...

Full description

Saved in:
Bibliographic Details
Main Author: Zhong, Jianlong
Other Authors: He Bingsheng
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/59975
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-59975
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Zhong, Jianlong
Parallel graph processing on graphics processing units
description Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs.
author2 He Bingsheng
author_facet He Bingsheng
Zhong, Jianlong
format Theses and Dissertations
author Zhong, Jianlong
author_sort Zhong, Jianlong
title Parallel graph processing on graphics processing units
title_short Parallel graph processing on graphics processing units
title_full Parallel graph processing on graphics processing units
title_fullStr Parallel graph processing on graphics processing units
title_full_unstemmed Parallel graph processing on graphics processing units
title_sort parallel graph processing on graphics processing units
publishDate 2014
url https://hdl.handle.net/10356/59975
_version_ 1759853038868103168
spelling sg-ntu-dr.10356-599752023-03-04T00:42:07Z Parallel graph processing on graphics processing units Zhong, Jianlong He Bingsheng School of Computer Engineering DRNTU::Engineering::Computer science and engineering Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs. DOCTOR OF PHILOSOPHY (SCE) 2014-05-21T05:59:11Z 2014-05-21T05:59:11Z 2014 2014 Thesis Zhong, J. (2014). Parallel graph processing on graphics processing units. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/59975 10.32657/10356/59975 en 142 p. application/pdf