Implementation of web product search engine : parallel incremental web crawler

One of the main objectives in designing a Parallel Incremental Web Crawler is to provide a solution to the problem of designing a large scale web-based Content Based Image Retrieval (CBIR) system. Our CBIR system has indexed more than 1 million images crawled from various Business to Consumer (B2C)...

Full description

Saved in:

Bibliographic Details
Main Author:	Lwi, Tiong Chai.
Other Authors:	Hoi Chu Hong
Format:	Final Year Project
Language:	English
Published:	2011
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Online Access:	http://hdl.handle.net/10356/46343
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-46343
record_format	dspace
spelling	sg-ntu-dr.10356-463432023-03-03T20:52:38Z Implementation of web product search engine : parallel incremental web crawler Lwi, Tiong Chai. Hoi Chu Hong School of Computer Engineering Centre for Advanced Information Systems DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval One of the main objectives in designing a Parallel Incremental Web Crawler is to provide a solution to the problem of designing a large scale web-based Content Based Image Retrieval (CBIR) system. Our CBIR system has indexed more than 1 million images crawled from various Business to Consumer (B2C) websites till date. The Internet traffic today is getting more complicated and analyzing how websites are interlinked and their content similarity is important for Web Mining. Due to the growing and dynamic nature of the web, it has poses unprecedented scaling challenges to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process for extraction of useful data from the web. In this report, we have proposed a novel architecture of a parallel crawler with an optimization model which is scalable and resilient against system crashes while maximizing the download rate and minimizing the overhead from parallelization based on API and domain specific crawling. We will also discuss how our crawling module is realized to make crawling task more effective and scalable in the collection process of data retrieval without recursive crawling on the same honey pot. We will also be discussing on the storage of extracted data using certain data management techniques and also image processing techniques such as Spatial Anti-Aliasing and enhancing by the crawler when an image is being processed and stored. Finally, several experiments were conducted to evaluate the processed data quality as well as the effectiveness of the algorithms parallel performance in the web crawler. In the experiment, several benchmarking test was also conducted to evaluate the CPU resource utilization as well as the freshness of the eVISE operational database. Bachelor of Engineering (Computer Science) 2011-12-02T04:35:05Z 2011-12-02T04:35:05Z 2011 2011 Final Year Project (FYP) http://hdl.handle.net/10356/46343 en Nanyang Technological University 103 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval Lwi, Tiong Chai. Implementation of web product search engine : parallel incremental web crawler
description	One of the main objectives in designing a Parallel Incremental Web Crawler is to provide a solution to the problem of designing a large scale web-based Content Based Image Retrieval (CBIR) system. Our CBIR system has indexed more than 1 million images crawled from various Business to Consumer (B2C) websites till date. The Internet traffic today is getting more complicated and analyzing how websites are interlinked and their content similarity is important for Web Mining. Due to the growing and dynamic nature of the web, it has poses unprecedented scaling challenges to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process for extraction of useful data from the web. In this report, we have proposed a novel architecture of a parallel crawler with an optimization model which is scalable and resilient against system crashes while maximizing the download rate and minimizing the overhead from parallelization based on API and domain specific crawling. We will also discuss how our crawling module is realized to make crawling task more effective and scalable in the collection process of data retrieval without recursive crawling on the same honey pot. We will also be discussing on the storage of extracted data using certain data management techniques and also image processing techniques such as Spatial Anti-Aliasing and enhancing by the crawler when an image is being processed and stored. Finally, several experiments were conducted to evaluate the processed data quality as well as the effectiveness of the algorithms parallel performance in the web crawler. In the experiment, several benchmarking test was also conducted to evaluate the CPU resource utilization as well as the freshness of the eVISE operational database.
author2	Hoi Chu Hong
author_facet	Hoi Chu Hong Lwi, Tiong Chai.
format	Final Year Project
author	Lwi, Tiong Chai.
author_sort	Lwi, Tiong Chai.
title	Implementation of web product search engine : parallel incremental web crawler
title_short	Implementation of web product search engine : parallel incremental web crawler
title_full	Implementation of web product search engine : parallel incremental web crawler
title_fullStr	Implementation of web product search engine : parallel incremental web crawler
title_full_unstemmed	Implementation of web product search engine : parallel incremental web crawler
title_sort	implementation of web product search engine : parallel incremental web crawler
publishDate	2011
url	http://hdl.handle.net/10356/46343
_version_	1759853140938588160

Implementation of web product search engine : parallel incremental web crawler

Similar Items