DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE
Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all bus...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/29798 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:29798 |
---|---|
spelling |
id-itb.:297982018-10-01T10:02:43ZDISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE Inayati - NIM: 23216038 , Nur’izzah Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/29798 Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all business activities in Indonesia because of the difficulty in obtaining information related to online businesses conducted by respondents and household members. Web crawling and web scraping are several ways to extract data from web pages. Because online shop sites use dynamic pages, simple web crawlers cannot retrieve data from that page. <br /> <br /> <br /> <br /> <br /> This research proposes the mechanism of web crawling web pages with dynamic data that is run in a distributed manner. The data extracted is the data of each shop account at two online shop sites. To extract data automatically, automated extraction mechanisms are designed using semantic analysis. To speed up the crawling process,designed a distributed crawling mechanism using Apache Spark. A prototype was built to test the design that was made. Some experiments used the prototype to determine the performance of the proposed distributed crawling. The experimental results show that automated extraction using semantic analysis provides good results with 100 percent precision and 94.94 percent recall. Distributed crawling can speed up the crawling process and simplify scalability settings. To increase the capacity of the extracted data, simply add resources in the form of a node without needing to change the application.. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Nowadays, online shops are growing very fast. There are many websites that provide a place for anyone who wants to have an online shop. The increasing number of online shops is currently a problem for Badan Pusat Statistik (Statistics of Indonesia) which is responsible for data collection of all business activities in Indonesia because of the difficulty in obtaining information related to online businesses conducted by respondents and household members. Web crawling and web scraping are several ways to extract data from web pages. Because online shop sites use dynamic pages, simple web crawlers cannot retrieve data from that page. <br />
<br />
<br />
<br />
<br />
This research proposes the mechanism of web crawling web pages with dynamic data that is run in a distributed manner. The data extracted is the data of each shop account at two online shop sites. To extract data automatically, automated extraction mechanisms are designed using semantic analysis. To speed up the crawling process,designed a distributed crawling mechanism using Apache Spark. A prototype was built to test the design that was made. Some experiments used the prototype to determine the performance of the proposed distributed crawling. The experimental results show that automated extraction using semantic analysis provides good results with 100 percent precision and 94.94 percent recall. Distributed crawling can speed up the crawling process and simplify scalability settings. To increase the capacity of the extracted data, simply add resources in the form of a node without needing to change the application.. |
format |
Theses |
author |
Inayati - NIM: 23216038 , Nur’izzah |
spellingShingle |
Inayati - NIM: 23216038 , Nur’izzah DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
author_facet |
Inayati - NIM: 23216038 , Nur’izzah |
author_sort |
Inayati - NIM: 23216038 , Nur’izzah |
title |
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
title_short |
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
title_full |
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
title_fullStr |
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
title_full_unstemmed |
DISTRIBUTED CRAWLING ON ONLINE SHOP WEBSITE |
title_sort |
distributed crawling on online shop website |
url |
https://digilib.itb.ac.id/gdl/view/29798 |
_version_ |
1822923036190310400 |