Cohesive subgraph mining in large graphs
Graphs are widely used to model entities and their relationships in different domains and some examples include social networks and biological networks. To analyse these graph data, cohesive subgraph mining is a fundamental approach which has attracted increasing attention from academics and industr...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171775 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-171775 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Information systems::Database management Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity |
spellingShingle |
Engineering::Computer science and engineering::Information systems::Database management Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity Yu, Kaiqiang Cohesive subgraph mining in large graphs |
description |
Graphs are widely used to model entities and their relationships in different domains and some examples include social networks and biological networks. To analyse these graph data, cohesive subgraph mining is a fundamental approach which has attracted increasing attention from academics and industries in the past decade. In this thesis, we study three cohesive subgraph mining problems on large graphs, namely (1) maximal k-biplex enumeration problem on bipartite graphs, (2) maximum k-biplex search problem on bipartite graphs, and (3) maximal quasi-clique enumeration problem on large general graphs.
The maximal k-biplex enumeration problem aims to enumerate all maximal k-biplexes (MBPs) on a bipartite graph, where a k-biplex is a subgraph with each vertex on one side disconnecting at most k vertices on the other side. Existing methods suffer from efficiency and/or scalability issues and have the time of waiting for the next output exponential w.r.t. the size of the input bipartite graph (i.e., an exponential delay). We adopt a reverse search framework called bTraversal, which corresponds to a depth-first search (DFS) procedure on an implicit solution graph on top of all MBPs. We then develop a series of techniques for improving and implementing this framework including (1) carefully selecting an initial solution to start DFS, (2) pruning the vast majority of links from the solution graph of bTraversal, and (3) implementing abstract procedures of the framework. The resulting algorithm is called iTraversal, which has its underlying solution graph significantly sparser than (around 0.1% of) that of bTraversal. Besides, iTraversal provides a guarantee of polynomial delay.
The maximum k-biplex search problem aims to find K MBPs with the most edges called maximum k-biplexes (MaxBPs) on bipartite graphs, where K is a positive integral user parameter. To solve this problem, we first formally prove the NP-hardness of the problem. We then design two branch-and-bound algorithms, among which, the better one called FastBB improves the worst-case time complexity to O*(γ_k^n), where O* suppresses the polynomials, γ_k is a real number that relies on k and is strictly smaller than 2, and n is the number of vertices in the graph. For example, for k=1, γ_k is equal to 1.754. We further introduce three techniques for boosting the performance of the branch-and-bound algorithms, among which, the best one called PBIE can further improve the time complexity to O*(γ_k^(d^3)) for large sparse graphs, where d is the maximum degree of the graph (note that d << n for sparse graphs).
The maximal quasi-clique enumeration problem aims to find all maximal γ-quasi-cliques (MQCs) on general graphs, where a γ-quasi-clique (QC) is a subgraph with each vertex connecting at least a fraction γ of the other vertices inside. One common practice of finding all MQCs is to (1) find a set of QCs containing all MQCs and then (2) filter out non-maximal QCs. While quite a few algorithms have been developed (which are branch-and-bound algorithms) for finding a set of QCs that contains all MQCs, all focus on sharpening the pruning techniques and devote little effort to improving the branching part. As a result, they provide no guarantee on pruning branches and all have the worst-case time complexity of O*(2^n), where O* suppresses the polynomials and n is the number of vertices in the graph. In this thesis, we focus on the problem of finding a set of QCs containing all MQCs but deviate from further sharpening the pruning techniques as existing methods do. We pay attention to both the pruning and branching parts and develop new pruning techniques and branching methods that would suit each other better towards pruning more branches both theoretically and practically. Specifically, we develop a new branch-and-bound algorithm called FastQC based on newly developed pruning techniques and branching methods, which improves the worst-case time complexity to O*( _k^n), where _k is a positive real number strictly smaller than 2. Furthermore, we develop a divide-and-conquer strategy for boosting the performance of FastQC.
We conduct extensive experiments on both real and synthetic datasets, and the results show that our algorithms are orders of magnitude faster than the state-of-the-art on real datasets. |
author2 |
Long Cheng |
author_facet |
Long Cheng Yu, Kaiqiang |
format |
Thesis-Doctor of Philosophy |
author |
Yu, Kaiqiang |
author_sort |
Yu, Kaiqiang |
title |
Cohesive subgraph mining in large graphs |
title_short |
Cohesive subgraph mining in large graphs |
title_full |
Cohesive subgraph mining in large graphs |
title_fullStr |
Cohesive subgraph mining in large graphs |
title_full_unstemmed |
Cohesive subgraph mining in large graphs |
title_sort |
cohesive subgraph mining in large graphs |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/171775 |
_version_ |
1784855573357920256 |
spelling |
sg-ntu-dr.10356-1717752023-12-01T01:52:37Z Cohesive subgraph mining in large graphs Yu, Kaiqiang Long Cheng School of Computer Science and Engineering c.long@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Database management Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity Graphs are widely used to model entities and their relationships in different domains and some examples include social networks and biological networks. To analyse these graph data, cohesive subgraph mining is a fundamental approach which has attracted increasing attention from academics and industries in the past decade. In this thesis, we study three cohesive subgraph mining problems on large graphs, namely (1) maximal k-biplex enumeration problem on bipartite graphs, (2) maximum k-biplex search problem on bipartite graphs, and (3) maximal quasi-clique enumeration problem on large general graphs. The maximal k-biplex enumeration problem aims to enumerate all maximal k-biplexes (MBPs) on a bipartite graph, where a k-biplex is a subgraph with each vertex on one side disconnecting at most k vertices on the other side. Existing methods suffer from efficiency and/or scalability issues and have the time of waiting for the next output exponential w.r.t. the size of the input bipartite graph (i.e., an exponential delay). We adopt a reverse search framework called bTraversal, which corresponds to a depth-first search (DFS) procedure on an implicit solution graph on top of all MBPs. We then develop a series of techniques for improving and implementing this framework including (1) carefully selecting an initial solution to start DFS, (2) pruning the vast majority of links from the solution graph of bTraversal, and (3) implementing abstract procedures of the framework. The resulting algorithm is called iTraversal, which has its underlying solution graph significantly sparser than (around 0.1% of) that of bTraversal. Besides, iTraversal provides a guarantee of polynomial delay. The maximum k-biplex search problem aims to find K MBPs with the most edges called maximum k-biplexes (MaxBPs) on bipartite graphs, where K is a positive integral user parameter. To solve this problem, we first formally prove the NP-hardness of the problem. We then design two branch-and-bound algorithms, among which, the better one called FastBB improves the worst-case time complexity to O*(γ_k^n), where O* suppresses the polynomials, γ_k is a real number that relies on k and is strictly smaller than 2, and n is the number of vertices in the graph. For example, for k=1, γ_k is equal to 1.754. We further introduce three techniques for boosting the performance of the branch-and-bound algorithms, among which, the best one called PBIE can further improve the time complexity to O*(γ_k^(d^3)) for large sparse graphs, where d is the maximum degree of the graph (note that d << n for sparse graphs). The maximal quasi-clique enumeration problem aims to find all maximal γ-quasi-cliques (MQCs) on general graphs, where a γ-quasi-clique (QC) is a subgraph with each vertex connecting at least a fraction γ of the other vertices inside. One common practice of finding all MQCs is to (1) find a set of QCs containing all MQCs and then (2) filter out non-maximal QCs. While quite a few algorithms have been developed (which are branch-and-bound algorithms) for finding a set of QCs that contains all MQCs, all focus on sharpening the pruning techniques and devote little effort to improving the branching part. As a result, they provide no guarantee on pruning branches and all have the worst-case time complexity of O*(2^n), where O* suppresses the polynomials and n is the number of vertices in the graph. In this thesis, we focus on the problem of finding a set of QCs containing all MQCs but deviate from further sharpening the pruning techniques as existing methods do. We pay attention to both the pruning and branching parts and develop new pruning techniques and branching methods that would suit each other better towards pruning more branches both theoretically and practically. Specifically, we develop a new branch-and-bound algorithm called FastQC based on newly developed pruning techniques and branching methods, which improves the worst-case time complexity to O*( _k^n), where _k is a positive real number strictly smaller than 2. Furthermore, we develop a divide-and-conquer strategy for boosting the performance of FastQC. We conduct extensive experiments on both real and synthetic datasets, and the results show that our algorithms are orders of magnitude faster than the state-of-the-art on real datasets. Doctor of Philosophy 2023-11-08T02:44:59Z 2023-11-08T02:44:59Z 2023 Thesis-Doctor of Philosophy Yu, K. (2023). Cohesive subgraph mining in large graphs. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171775 https://hdl.handle.net/10356/171775 10.32657/10356/171775 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |