PARALLEL FACETNET ALGORITHM WITH VERTEX CUT TECHNIQUE AS A METHOD FOR COMMUNITY EVOLUTION DETECTION ON THE BIG GRAPH
The community evolution model is one of the most valuable information to support business activities because it can be used to understand community dynamics. This model can be obtained by detecting the evolution of the community on the graph. Meanwhile, the development of information technology has...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/46674 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The community evolution model is one of the most valuable information to support business activities because it can be used to understand community dynamics. This model can be obtained by detecting the evolution of the community on the graph. Meanwhile, the development of information technology has enable various organizations to store very large amounts of data as well as has resulted in the emergence of large-scale graph phenomena. To date, the method that has been developed to detect the evolution of communities is only able to process the graph with a maximum size of around a million vertices. Therefore, we need to develop the community evolution detection method for large-scale graphs. Studies related to the large graph processing show that the best way to process enormous graphs is by utilizing a distributed system because in general, the large graphs will require a large computer’s memory as well. The size of the single computer’s memory is generally unable to process the large graph because of the limited memory size. Based on the preliminary study of this research, the Facetnet algorithm is one of the community evolution detection algorithms that has the potential to have linear algorithm complexity so that it is considered suitable for processing large-scale graphs, also to produce a complete evolutionary model. So, it is considered very useful for developing this Facetnet algorithm to be able to process very large graphs using a distributed system.
To be able to process the huge graph using a distributed system there are several critical components, namely, graph partitioning technique and an algorithm that can be run in parallel. Graph partitioning technique is needed so that the large graph can be divided to produce some smaller subgraphs. In consequence, each subgraph can be processed in parallel using a computer that is part of a distributed system. This research has analyzed graph splitting techniques and found that the Vertex Cut is the best technique to be combined with the modified Facetnet algorithm resulting in the Parallel Facetnet algorithm that can be run on a distributed system to develop a community evolution model. However, simply dividing large graphs into subgraphs and combining them with the Parallel Facetnet algorithm is not adequate to be a method to process a tremendous graph. This is because of the probability matrix that has a huge size. This research has also succeeded in finding graph representation with attribute vertices to overcome this problem. With a graph structure that has an attribute node, the Parallel Facetnet algorithm is no longer necessary to define or store the adjacency matrix and probability membership matrix separately. Both matrices can already be replaced by subgraphs with attribute nodes that can be stored in parallel, as well as processed in parallel by the Parallel Facetnet algorithm using local memory on each computer in a distributed system. Thus, the community evolution detection method using these three important elements be able to process a community evolution model for large-scale graphs. The limitation of memory capacity on a computer to process large graphs can be overcome by adding one or more computers in the distributed system, without changing the logic of the process defined in the Parallel Facetnet algorithm. Remarkably, this method is predicted to be able to process graphs with a million and even up to billion vertices.
This research also found that Facetnet has a weakness, that is the algorithm needs to determine the number of communities as conjecture at the beginning of the iteration, resulting in the community evolution model with the same number of communities for different graphs. This is certainly not in accordance by the real conditions that allow the various number of communities contained in a graph from time to time. Therefore, it has been tried to develop a method to determine the number of communities automatically, which uses the Nullity of the Laplacian Adjacency Matrix (NML) of the graph. NML is proven to be able to show the number of communities contained in a graph automatically, although this technique cannot be combined with the Parallel Facetnet algorithm yet, because it still needs to develop an algorithm that can calculate this NML value in parallel using a partitioned adjacency matrix.
This research was carried out in stages starting from analyzing various existing methods, analyzing and designing the proposed method, conducting experiments to test the ability of the proposed method until finally a conclusion can be drawn on the success of the study. Experimental results prove the community evolution detection method consisting of the Parallel Facetnet algorithm using the Vertex Cut technique and graph structure with attribute vertices capable of producing community evolution models of graphs of up to about two million vertices and 15 million edges using a distributed system of Spark clusters consisting of nine computers. This means that the Spark cluster can process the graph with a maximum of 32 processors and a maximum total worker memory of 72 GB, because each computer has four processors and 10 GB of memory. The Parallel Facetnet algorithm is proven to be an algorithm that has high scalability because it has linear time complexity. Based on experiments using the existing Spark cluster infrastructure it can also be shown that each computer is able to process graphs that appear to require memory from the total memory available in the cluster, as long as the memory requirements for processing each subgraph do not exceed about 55% of the memory on each computer. |
---|