#TITLE_ALTERNATIVE#

As the web grows and the need for larger web document is increasing, a high performance web crawler is required. A single web crawler is practically not capable of handling such need. With a high performance web crawler, it makes it possible to do parallel processing. Thus, the approach taken is by...

Full description

Saved in:
Bibliographic Details
Main Author: SYAMSU (NIM 23205037), IQBAL
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/10683
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:As the web grows and the need for larger web document is increasing, a high performance web crawler is required. A single web crawler is practically not capable of handling such need. With a high performance web crawler, it makes it possible to do parallel processing. Thus, the approach taken is by building a parallel distributed crawling system, which allows a large amount of web pages to be handled in a shorter period of time.<p> <br /> <br /> <br /> <br /> <br /> This paper describes the design of distributed crawler for a web search engine. The main focus of the design is on the issues such as overlap, communication overhead and how to minimize their effects using a coordinated system. The design consists of four crawler processes that have been tested in a parallel crawler intra-site network, using breadth-first strategy with exchange mode. Analysis has been done on a 1.2 GB data sample which is resulted from a crawler using query on a database coordinator.<p> <br /> <br /> <br /> <br /> <br /> By using a distributed parallel processing, the performance of a crawler increases. However, the addition of process number is not always directly proportional with performance. In addition, the modeling using exchange mode has an overlap value (N-I)/I which is smaller but increases scope value.