Efficient and scalable techniques for pagerank-based graph analytics

Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRan...

Full description

Saved in:

Bibliographic Details
Main Author:	Yang, Renchi
Other Authors:	Sourav S Bhowmick
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2020
Subjects:	Engineering::Computer science and engineering::Information systems::Database management
Online Access:	https://hdl.handle.net/10356/145185
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-145185
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Information systems::Database management
spellingShingle	Engineering::Computer science and engineering::Information systems::Database management Yang, Renchi Efficient and scalable techniques for pagerank-based graph analytics
description	Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great eﬃcacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also ﬁnd numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more eﬃcient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to reﬁne the results in an optimized and non-trivial way. In particular, TEA+ oﬀers practical eﬃciency and eﬀectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a ﬁxed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited eﬀectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very diﬀerent purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into eﬀective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an eﬀective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and eﬀectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly eﬃcient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings.
author2	Sourav S Bhowmick
author_facet	Sourav S Bhowmick Yang, Renchi
format	Thesis-Doctor of Philosophy
author	Yang, Renchi
author_sort	Yang, Renchi
title	Efficient and scalable techniques for pagerank-based graph analytics
title_short	Efficient and scalable techniques for pagerank-based graph analytics
title_full	Efficient and scalable techniques for pagerank-based graph analytics
title_fullStr	Efficient and scalable techniques for pagerank-based graph analytics
title_full_unstemmed	Efficient and scalable techniques for pagerank-based graph analytics
title_sort	efficient and scalable techniques for pagerank-based graph analytics
publisher	Nanyang Technological University
publishDate	2020
url	https://hdl.handle.net/10356/145185
_version_	1695706159356313600
spelling	sg-ntu-dr.10356-1451852021-03-02T08:37:26Z Efficient and scalable techniques for pagerank-based graph analytics Yang, Renchi Sourav S Bhowmick School of Computer Science and Engineering ASSourav@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Database management Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great eﬃcacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also ﬁnd numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more eﬃcient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to reﬁne the results in an optimized and non-trivial way. In particular, TEA+ oﬀers practical eﬃciency and eﬀectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a ﬁxed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited eﬀectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very diﬀerent purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into eﬀective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an eﬀective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and eﬀectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly eﬃcient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. Doctor of Philosophy 2020-12-15T01:33:17Z 2020-12-15T01:33:17Z 2020 Thesis-Doctor of Philosophy Yang, R. (2020). Efficient and scalable techniques for pagerank-based graph analytics. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/145185 10.32657/10356/145185 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Efficient and scalable techniques for pagerank-based graph analytics

Similar Items