DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE

Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all bus...

Full description

Saved in:
Bibliographic Details
Main Author: Inayati - NIM: 23216038 , Nur’izzah
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/29798
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:29798
spelling id-itb.:297982018-10-01T10:02:43ZDISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE Inayati - NIM: 23216038 , Nur’izzah Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/29798 Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all business activities in Indonesia because of the difficulty in obtaining information related to online businesses conducted by respondents and household members. Web crawling and web scraping are several ways to extract data from web pages. Because online shop sites use dynamic pages, simple web crawlers cannot retrieve data from that page. <br /> <br /> <br /> <br /> <br /> This research proposes the mechanism of web crawling web pages with dynamic data that is run in a distributed manner. The data extracted is the data of each shop account at two online shop sites. To extract data automatically, automated extraction mechanisms are designed using semantic analysis. To speed up the crawling process,designed a distributed crawling mechanism using Apache Spark. A prototype was built to test the design that was made. Some experiments used the prototype to determine the performance of the proposed distributed crawling. The experimental results show that automated extraction using semantic analysis provides good results with 100 percent precision and 94.94 percent recall. Distributed crawling can speed up the crawling process and simplify scalability settings. To increase the capacity of the extracted data, simply add resources in the form of a node without needing to change the application.. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all business activities in Indonesia because of the difficulty in obtaining information related to online businesses conducted by respondents and household members. Web crawling and web scraping are several ways to extract data from web pages. Because online shop sites use dynamic pages, simple web crawlers cannot retrieve data from that page. <br /> <br /> <br /> <br /> <br /> This research proposes the mechanism of web crawling web pages with dynamic data that is run in a distributed manner. The data extracted is the data of each shop account at two online shop sites. To extract data automatically, automated extraction mechanisms are designed using semantic analysis. To speed up the crawling process,designed a distributed crawling mechanism using Apache Spark. A prototype was built to test the design that was made. Some experiments used the prototype to determine the performance of the proposed distributed crawling. The experimental results show that automated extraction using semantic analysis provides good results with 100 percent precision and 94.94 percent recall. Distributed crawling can speed up the crawling process and simplify scalability settings. To increase the capacity of the extracted data, simply add resources in the form of a node without needing to change the application..
format Theses
author Inayati - NIM: 23216038 , Nur’izzah
spellingShingle Inayati - NIM: 23216038 , Nur’izzah
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
author_facet Inayati - NIM: 23216038 , Nur’izzah
author_sort Inayati - NIM: 23216038 , Nur’izzah
title DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
title_short DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
title_full DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
title_fullStr DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
title_full_unstemmed DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
title_sort distributed crawling on online shop website
url https://digilib.itb.ac.id/gdl/view/29798
_version_ 1822923036190310400