Optimized GPU algorithms for sparse data problems

Many important problems in science and engineering today deal with sparse data. Examples of sparse data include sparse matrices, where the number of nonzero values is much smaller than the total number possible and where nonzeros are located in scattered instead of regular positions, and graphs in w...

Full description

Saved in:
Bibliographic Details
Main Author: Pham, Nguyen Quang Anh
Other Authors: Wen Yonggang
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/72409
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Many important problems in science and engineering today deal with sparse data. Examples of sparse data include sparse matrices, where the number of nonzero values is much smaller than the total number possible and where nonzeros are located in scattered instead of regular positions, and graphs in which the average degree is low and the edge set has an irregular structure. This type of data frequently arises from real-world sources and is used to model physical, biological or social phenomena. Many sparse datasets are large, and require parallel computing to process efficiently. But sparsity leads to a number of performance challenges. Irregularities in the size and distribution of data leads to load imbalance between threads and scattered memory accesses that strain memory systems optimized for block based accesses. Graphics processing units (GPUs) have been successfully used in recent years to solve many big data problems. GPUs can execute thousands of threads simultaneously, and have much higher throughput and memory bandwidth than CPUs. Nevertheless, the GPU architecture is more suited to processing dense, regular datasets. Sparse data problems such as sparse matrix-vector and matrix-matrix multiplication, breadth first search, shortest paths and other graph algorithms achieve much less speedup on GPUs than their dense data counterparts. The problems listed earlier of load imbalance, high memory latency and bandwidth saturation are compounded on the GPUs due to its massively multithreading and SIMD execution model. In this thesis, we study three fundamental sparse data problems, sparse matrixvector multiplication (SpMV), sparse matrix-matrix multiplication (SpGEMM) and graph coloring. Each problem is used as a primitive in a number of higher level applications, and thus accelerating these problems leads to a broad range of improvements for other problems. For these problems, we present GPU algorithms based on novel techniques and which offer best in class performance. Our algorithms are based on analyzing the key performance issues and bottlenecks for each problem, and use both heuristical and theoretically motivated techniques to overcome these limitations. While some of the techniques are problem specific, others can be generalized to deal with issues common to many GPU based sparse data computations. Our SpMV algorithm is based on compacting a sparse matrix to increase its density and regularity of data access, and also making use of the GPU’s fast shared memory to increase the efficiency of repeated SpMV computations. The algorithm reduces I/O for vector accesses by 37% on average, and improves performance up to 35% compared to the previously fastest GPU SpMV algorithm. Our SpGEMM algorithm efficiently enumerates all the work done during a computation to achieve perfect load balancing, and also uses a randomized algorithm to nearly optimally partition the matrix into pieces small enough to be operated on using the fast but limited amount of shared memory. It is up to 2.5× faster than the state of the art GPU SpGEMM algorithm on the most difficult, unstructured matrices. Finally, we present two coloring algorithms optimized respectively for coloring quality and speed. The first algorithm uses a simple counter mechanism to greatly improve overall work efficiency, while the second algorithm achieves both high parallelism and relatively high efficiency by randomly coloring the graph based on estimates of its chromatic number. Compared to existing GPU coloring algorithms, our first algorithm uses 1.1 − 4.3× fewer colors on average, while the second algorithm uses slightly more colors but runs 2.7 − 4.3× faster than other algorithms. The techniques we introduced are the basis for our ongoing work on GPU sparse matrix and graph algorithms, as we seek to bridge the gap between the performance of sparse and dense data algorithms on GPUs.