An Automated Algorithm for Extracting Website Skeleton
The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-aut...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2004
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/1040 https://ink.library.smu.edu.sg/context/sis_research/article/2039/viewcontent/10.1.1.10.9158.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-2039 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-20392018-06-13T04:04:13Z An Automated Algorithm for Extracting Website Skeleton LIU, Zehua NG, Wee-Keong LIM, Ee Peng The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision. 2004-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/1040 info:doi/10.1007/978-3-540-24571-1_70 https://ink.library.smu.edu.sg/context/sis_research/article/2039/viewcontent/10.1.1.10.9158.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Numerical Analysis and Scientific Computing |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Databases and Information Systems Numerical Analysis and Scientific Computing |
spellingShingle |
Databases and Information Systems Numerical Analysis and Scientific Computing LIU, Zehua NG, Wee-Keong LIM, Ee Peng An Automated Algorithm for Extracting Website Skeleton |
description |
The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision. |
format |
text |
author |
LIU, Zehua NG, Wee-Keong LIM, Ee Peng |
author_facet |
LIU, Zehua NG, Wee-Keong LIM, Ee Peng |
author_sort |
LIU, Zehua |
title |
An Automated Algorithm for Extracting Website Skeleton |
title_short |
An Automated Algorithm for Extracting Website Skeleton |
title_full |
An Automated Algorithm for Extracting Website Skeleton |
title_fullStr |
An Automated Algorithm for Extracting Website Skeleton |
title_full_unstemmed |
An Automated Algorithm for Extracting Website Skeleton |
title_sort |
automated algorithm for extracting website skeleton |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2004 |
url |
https://ink.library.smu.edu.sg/sis_research/1040 https://ink.library.smu.edu.sg/context/sis_research/article/2039/viewcontent/10.1.1.10.9158.pdf |
_version_ |
1770570832531161088 |