Parallel graph processing on graphics processing units

Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GP...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhong, Jianlong
Other Authors:	He Bingsheng
Format:	Theses and Dissertations
Language:	English
Published:	2014
Subjects:	DRNTU::Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/59975
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-59975
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Zhong, Jianlong Parallel graph processing on graphics processing units
description	Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs.
author2	He Bingsheng
author_facet	He Bingsheng Zhong, Jianlong
format	Theses and Dissertations
author	Zhong, Jianlong
author_sort	Zhong, Jianlong
title	Parallel graph processing on graphics processing units
title_short	Parallel graph processing on graphics processing units
title_full	Parallel graph processing on graphics processing units
title_fullStr	Parallel graph processing on graphics processing units
title_full_unstemmed	Parallel graph processing on graphics processing units
title_sort	parallel graph processing on graphics processing units
publishDate	2014
url	https://hdl.handle.net/10356/59975
_version_	1759853038868103168
spelling	sg-ntu-dr.10356-599752023-03-04T00:42:07Z Parallel graph processing on graphics processing units Zhong, Jianlong He Bingsheng School of Computer Engineering DRNTU::Engineering::Computer science and engineering Graphs are de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the GPU (Graphics Processing Unit) has been adopted to accelerate many specific graph processing algorithms such as BFS and shortest path. This is because GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs. However, superb hardware power of the GPU does not automatically lead to superb performance of graph processing. We are facing various challenges in efficiency and programmability of building parallel graph processing systems on GPUs. Despite the recently improved programmability of GPUs, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. Even worse, there lacks an efficient method and runtime system on the GPU to support concurrent graph processing tasks from multiple applications and/or users. To address those challenges, we develop the Medusa system to simplify parallel graph processing on the GPU and to support high-throughput executions of concurrent GPU tasks. This thesis presents the design, implementation and experimental evaluations of Medusa, followed by detailed case studies of Medusa in real-world graph applications. To democratize GPU accelerated graph processing, Medusa proposes a programming framework to enable users to harnessing the power of GPUs by writing sequential C code. Particularly, Medusa offers a small set of APIs for developers to define their application logics, and embraces a runtime system to automatically execute the user-defined functions in parallel on GPUs. We further develop a series of graph-centric optimizations based on the architecture features of GPU for the efficiency of Medusa. With the optimized system on a single GPU, we further extend it to execute on multiple GPUs within a machine and develop optimizations on computation and data transfer. With the improved programmability of Medusa, users can implement their graph processing algorithms and submit multiple tasks to the GPU for concurrent executions. It therefore requires an optimized runtime system for managing concurrent GPU tasks. As a matter of fact, GPUs are usually deployed in shared environments such as clusters and clouds. In such shared environments, many tasks are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Particularly, we propose Kernelet as the runtime back end system to optimize the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory and PCI-e data transfer management techniques, as well as dynamic slicing and scheduling techniques to support efficient concurrent kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31% and 23% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs for our benchmark kernels, respectively. Kernelet improves the throughput of concurrent Medusa applications by 29% on C2050. As a case study, we collaborate with our colleagues to adopt Medusa in a real research project. Particularly, we adopt Medusa to implement GPU accelerated simulation of information propagation over large-scale and complex social networks. For a number of applications, Medusa is used as the base infrastructure, which allows our colleagues to develop additional application-specific optimizations for performance improvement. From this case study, we demonstrate that the programmability of Medusa improves the coding productivity of domain specific applications, and Medusa brings significant performance speedups by leveraging the GPUs. DOCTOR OF PHILOSOPHY (SCE) 2014-05-21T05:59:11Z 2014-05-21T05:59:11Z 2014 2014 Thesis Zhong, J. (2014). Parallel graph processing on graphics processing units. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/59975 10.32657/10356/59975 en 142 p. application/pdf

Parallel graph processing on graphics processing units

Similar Items