Enumeration of substructure patterns from large graphs
Graph is a ubiquitous data model and can be used to represent complex entities and their relationships, such as bioinformatics and biochemistry networks, traffic networks, communication networks, social networks, etc. This thesis study the problem of enumeration of fundamental substructure...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/54633 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Graph is a ubiquitous data model and can be used to represent complex entities and their relationships,
such as bioinformatics and biochemistry networks, traffic networks, communication
networks, social networks, etc. This thesis study the problem of enumeration of fundamental
substructure patterns in graphs. The enumeration of substructure patterns of graphs plays an
important role in network analysis, since these problems are crucial to the understanding of
complex networks and integral parts of many more advanced techniques of graph analysis.
In recent years, we have seen a dramatic increase in the number, the size, and the variety of
networks. For example, there are over 800 million active users in Facebook and there are over
1 billion webpages in WWW according to Google. How to handle big data becomes a very
challenging problem for graph analysis. Due to the large volume, large graphs cannot fit into
main memory of a single computer any longer. And due to the huge gap between the efficiency
of accessing data from external memory and from main memory, (in terms of bandwidth and
random access time), existing algorithms on enumerating substructures become inefficient or
even infeasible for large graphs.
In this thesis, we study the problem of enumerating substructures of large disk-resident
graph. We make the following major contributions.
-We developed the first I/O-efficient algorithms for listing triangles in massive networks
and applied them in the computation of clustering coefficient, transitivity, triangular connectivity,
etc. For the same dataset, our algorithm outperform existing algorithm by orders
of magnitude.
-We developed a general algorithm framework to guarantee performance in networks with
different characteristics and worked out a non-trivial parallel version of this algorithm.
This algorithm is up to 1,000 times faster than the state-of-the-art algorithms for clique enumeration in disk-resident networks with a single machine and shows near linear scale
up given more machines. |
---|