Efficient techniques for subgraph mining and query processing

Graph data has been so prevalent that efficiently obtaining useful information from them is highly demanded. Given massive amounts of graph data, people are often interested in a small portion, namely their subgraphs, by the processes of mining and querying. Due to the enormous number of subgraphs i...

Full description

Saved in:
Bibliographic Details
Main Author: Lin, Wenqing
Other Authors: Xiao Xiaokui
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/62137
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-62137
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems::Database management
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Database management
Lin, Wenqing
Efficient techniques for subgraph mining and query processing
description Graph data has been so prevalent that efficiently obtaining useful information from them is highly demanded. Given massive amounts of graph data, people are often interested in a small portion, namely their subgraphs, by the processes of mining and querying. Due to the enormous number of subgraphs in the massive graph data, these processes are highly costly. In this thesis, we study three important problems on subgraph mining and query processing, i.e., frequent subgraph mining, network motif discovery, and generalized subgraph query processing. These problems find numerous applications in real world, whereas they are extremely challenging.% since subgraph isomorphism testing is NP-hard. First, mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step {\em filter-and-refinement} approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are {\em locally frequent} in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the {\em globally frequent} graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs. Second, the identification of network motifs has essential applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires to enumerate subgraphs from a real-life graph, and compute the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this thesis presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) drastically differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around $20$ times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used. Finally, we study a new type of graph queries, which injectively maps its edges to paths of the graphs in a given database, where the length of each path is constrained by a given threshold specified by the weight of the corresponding matching edge. We give important applications of the new graph query and identify new challenges of processing such a query. In particular, the new type of graph queries further explodes the already exponential search space compared with conventional graph queries. Besides, existing pruning techniques cannot be directly applied or they are simply not adequate, since our new type of query graphs is different from their indexed features. To address these issues, we devise the cost model of the branch-and-bound algorithm framework for processing the graph query, and propose an efficient algorithm to minimize the cost overhead. We also develop three indexing techniques to efficiently answer the queries online, each of which can be applied into different scenarios. Finally, we verify the efficiency of our proposed indexes with extensive experiments on large real and synthetic datasets.
author2 Xiao Xiaokui
author_facet Xiao Xiaokui
Lin, Wenqing
format Theses and Dissertations
author Lin, Wenqing
author_sort Lin, Wenqing
title Efficient techniques for subgraph mining and query processing
title_short Efficient techniques for subgraph mining and query processing
title_full Efficient techniques for subgraph mining and query processing
title_fullStr Efficient techniques for subgraph mining and query processing
title_full_unstemmed Efficient techniques for subgraph mining and query processing
title_sort efficient techniques for subgraph mining and query processing
publishDate 2015
url https://hdl.handle.net/10356/62137
_version_ 1759857668713873408
spelling sg-ntu-dr.10356-621372023-03-04T00:42:26Z Efficient techniques for subgraph mining and query processing Lin, Wenqing Xiao Xiaokui School of Computer Engineering Centre for Advanced Information Systems DRNTU::Engineering::Computer science and engineering::Information systems::Database management Graph data has been so prevalent that efficiently obtaining useful information from them is highly demanded. Given massive amounts of graph data, people are often interested in a small portion, namely their subgraphs, by the processes of mining and querying. Due to the enormous number of subgraphs in the massive graph data, these processes are highly costly. In this thesis, we study three important problems on subgraph mining and query processing, i.e., frequent subgraph mining, network motif discovery, and generalized subgraph query processing. These problems find numerous applications in real world, whereas they are extremely challenging.% since subgraph isomorphism testing is NP-hard. First, mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step {\em filter-and-refinement} approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are {\em locally frequent} in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the {\em globally frequent} graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs. Second, the identification of network motifs has essential applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires to enumerate subgraphs from a real-life graph, and compute the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this thesis presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) drastically differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around $20$ times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used. Finally, we study a new type of graph queries, which injectively maps its edges to paths of the graphs in a given database, where the length of each path is constrained by a given threshold specified by the weight of the corresponding matching edge. We give important applications of the new graph query and identify new challenges of processing such a query. In particular, the new type of graph queries further explodes the already exponential search space compared with conventional graph queries. Besides, existing pruning techniques cannot be directly applied or they are simply not adequate, since our new type of query graphs is different from their indexed features. To address these issues, we devise the cost model of the branch-and-bound algorithm framework for processing the graph query, and propose an efficient algorithm to minimize the cost overhead. We also develop three indexing techniques to efficiently answer the queries online, each of which can be applied into different scenarios. Finally, we verify the efficiency of our proposed indexes with extensive experiments on large real and synthetic datasets. DOCTOR OF PHILOSOPHY (SCE) 2015-01-21T07:58:55Z 2015-01-21T07:58:55Z 2015 2015 Thesis Lin, W. (2015). Efficient techniques for subgraph mining and query processing. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/62137 10.32657/10356/62137 en 115 p. application/pdf