Efficient and scalable techniques for pagerank-based graph analytics

Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRan...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Renchi
Other Authors: Sourav S Bhowmick
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/145185
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-145185
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Information systems::Database management
spellingShingle Engineering::Computer science and engineering::Information systems::Database management
Yang, Renchi
Efficient and scalable techniques for pagerank-based graph analytics
description Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great efficacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also find numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a fixed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited effectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very different purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into effective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an effective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings.
author2 Sourav S Bhowmick
author_facet Sourav S Bhowmick
Yang, Renchi
format Thesis-Doctor of Philosophy
author Yang, Renchi
author_sort Yang, Renchi
title Efficient and scalable techniques for pagerank-based graph analytics
title_short Efficient and scalable techniques for pagerank-based graph analytics
title_full Efficient and scalable techniques for pagerank-based graph analytics
title_fullStr Efficient and scalable techniques for pagerank-based graph analytics
title_full_unstemmed Efficient and scalable techniques for pagerank-based graph analytics
title_sort efficient and scalable techniques for pagerank-based graph analytics
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/145185
_version_ 1695706159356313600
spelling sg-ntu-dr.10356-1451852021-03-02T08:37:26Z Efficient and scalable techniques for pagerank-based graph analytics Yang, Renchi Sourav S Bhowmick School of Computer Science and Engineering ASSourav@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Database management Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great efficacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also find numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a fixed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited effectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very different purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into effective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an effective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. Doctor of Philosophy 2020-12-15T01:33:17Z 2020-12-15T01:33:17Z 2020 Thesis-Doctor of Philosophy Yang, R. (2020). Efficient and scalable techniques for pagerank-based graph analytics. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/145185 10.32657/10356/145185 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University