Efficient and scalable techniques for pagerank-based graph analytics
Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRan...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/145185 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-145185 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Information systems::Database management |
spellingShingle |
Engineering::Computer science and engineering::Information systems::Database management Yang, Renchi Efficient and scalable techniques for pagerank-based graph analytics |
description |
Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great efficacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also find numerous applications in both academia and industry.
First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations.
Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a fixed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited effectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very different purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into effective node embeddings, by iteratively solving an optimization problem.
Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an effective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. |
author2 |
Sourav S Bhowmick |
author_facet |
Sourav S Bhowmick Yang, Renchi |
format |
Thesis-Doctor of Philosophy |
author |
Yang, Renchi |
author_sort |
Yang, Renchi |
title |
Efficient and scalable techniques for pagerank-based graph analytics |
title_short |
Efficient and scalable techniques for pagerank-based graph analytics |
title_full |
Efficient and scalable techniques for pagerank-based graph analytics |
title_fullStr |
Efficient and scalable techniques for pagerank-based graph analytics |
title_full_unstemmed |
Efficient and scalable techniques for pagerank-based graph analytics |
title_sort |
efficient and scalable techniques for pagerank-based graph analytics |
publisher |
Nanyang Technological University |
publishDate |
2020 |
url |
https://hdl.handle.net/10356/145185 |
_version_ |
1695706159356313600 |
spelling |
sg-ntu-dr.10356-1451852021-03-02T08:37:26Z Efficient and scalable techniques for pagerank-based graph analytics Yang, Renchi Sourav S Bhowmick School of Computer Science and Engineering ASSourav@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Database management Graphs are ubiquitous today and are a fundamental data structure to represent objects and their relations in various domains, e.g., social science, citation analysis, weblink analysis, and biological informatics. PageRank-based techniques such as personalized PageRank, heat kernel PageRank, TrustRank have been well studied and shown great efficacy in graph processing tasks including web ranking, recommender system, graph clustering and combating web spam. In this thesis, we investigate three important problems in graph processing by exploiting PageRank-based techniques, namely, local graph clustering, homogeneous network embedding and attributed network embedding. These three problems are not only interesting in themselves when the graph size becomes large, but also find numerous applications in both academia and industry. First, we study the local graph clustering on undirected graphs. Given an undirected graph G and a seed node s, the local clustering problem aims to identify a high-quality cluster containing s in time roughly proportional to the size of the cluster, regardless of the size of G. Recently, heat kernel PageRank (HKPR) is applied to this problem and found to be more efficient compared with prior methods. However, existing solutions for computing HKPR either are prohibitively expensive or provide unsatisfactory error approximation on HKPR values, rendering them impractical especially on billion-edge graphs. Thus, we present TEA and TEA+, which utilize deterministic graph traversal to produce a rough estimation of the exact HKPR vector, and then exploit Monte-Carlo random walks to refine the results in an optimized and non-trivial way. In particular, TEA+ offers practical efficiency and effectiveness due to non-trivial optimizations. Second, we investigate the homogeneous network embedding (HNE) problem. Given an input graph G and a node v ∈ G, HNE aims to map the graph structure in the vicinity of v to a fixed-dimensional feature vector. Existing approaches to HNE are either immensely expensive, and, thus, are limited to small graphs or fail to fully capture the local graph structure, leading to limited effectiveness of the extracted feature vectors. Meanwhile, in recent years there have been rapid advancements in scalable algorithms for computing approximate personalized PageRank (PPR), which captures rich graph topological information. However, PPR was designed for a very different purpose, i.e., ranking nodes in G based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility. Motivated by this, we propose Node-Reweighted PageRank (NRP), a novel solution that transforms PPR values into effective node embeddings, by iteratively solving an optimization problem. Finally, we consider the attributed network embedding (ANE) problem. Given a graph G where each node is associated with a set of attributes, ANE maps each node v in G to a compact vector Xv, which can be used in downstream machine learning tasks. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. We propose PANE, an effective and scalable approach to ANE computation for massive graphs. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel node attribute PageRank model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. Doctor of Philosophy 2020-12-15T01:33:17Z 2020-12-15T01:33:17Z 2020 Thesis-Doctor of Philosophy Yang, R. (2020). Efficient and scalable techniques for pagerank-based graph analytics. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/145185 10.32657/10356/145185 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |