Term frequency-information content for focused crawling to predict relevant web pages.

With the rapid growth of the Web, finding desirable information on the Internet is a tedious and time consuming task. Focused crawlers are the golden keys to solve this issue through mining of the Web content. In this regard, a variety of methods have been devised and implemented. Many of these meth...

Full description

Saved in:
Bibliographic Details
Main Authors: Pesaranghader, Ali, Mustapha, Norwati
Format: Article
Language:English
English
Published: Advanced Institute of Convergence Information Technology 2013
Online Access:http://psasir.upm.edu.my/id/eprint/30629/1/Term%20frequency.pdf
http://psasir.upm.edu.my/id/eprint/30629/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Putra Malaysia
Language: English
English
id my.upm.eprints.30629
record_format eprints
spelling my.upm.eprints.306292015-10-28T03:18:09Z http://psasir.upm.edu.my/id/eprint/30629/ Term frequency-information content for focused crawling to predict relevant web pages. Pesaranghader, Ali Mustapha, Norwati With the rapid growth of the Web, finding desirable information on the Internet is a tedious and time consuming task. Focused crawlers are the golden keys to solve this issue through mining of the Web content. In this regard, a variety of methods have been devised and implemented. Many of these methods coming from information retrieval viewpoint are not biased towards more informative terms in multi-term topics (topics with more than one keyword). In this paper, by considering terms’ information contents, we propose Term Frequency-Information Content (TF-IC) method which assigns appropriate weight to each term in a multi-term topic. Through the conducted experiments, we compare our method with other methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Semantic Indexing (LSI). Experimental results show that our method outperforms those two methods by retrieving more relevant pages for multi-term topics. Advanced Institute of Convergence Information Technology 2013-08 Article PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/30629/1/Term%20frequency.pdf Pesaranghader, Ali and Mustapha, Norwati (2013) Term frequency-information content for focused crawling to predict relevant web pages. International Journal of Digital Content Technology and its Applications, 7 (12). pp. 113-122. ISSN 1975-9339 English
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
English
description With the rapid growth of the Web, finding desirable information on the Internet is a tedious and time consuming task. Focused crawlers are the golden keys to solve this issue through mining of the Web content. In this regard, a variety of methods have been devised and implemented. Many of these methods coming from information retrieval viewpoint are not biased towards more informative terms in multi-term topics (topics with more than one keyword). In this paper, by considering terms’ information contents, we propose Term Frequency-Information Content (TF-IC) method which assigns appropriate weight to each term in a multi-term topic. Through the conducted experiments, we compare our method with other methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Semantic Indexing (LSI). Experimental results show that our method outperforms those two methods by retrieving more relevant pages for multi-term topics.
format Article
author Pesaranghader, Ali
Mustapha, Norwati
spellingShingle Pesaranghader, Ali
Mustapha, Norwati
Term frequency-information content for focused crawling to predict relevant web pages.
author_facet Pesaranghader, Ali
Mustapha, Norwati
author_sort Pesaranghader, Ali
title Term frequency-information content for focused crawling to predict relevant web pages.
title_short Term frequency-information content for focused crawling to predict relevant web pages.
title_full Term frequency-information content for focused crawling to predict relevant web pages.
title_fullStr Term frequency-information content for focused crawling to predict relevant web pages.
title_full_unstemmed Term frequency-information content for focused crawling to predict relevant web pages.
title_sort term frequency-information content for focused crawling to predict relevant web pages.
publisher Advanced Institute of Convergence Information Technology
publishDate 2013
url http://psasir.upm.edu.my/id/eprint/30629/1/Term%20frequency.pdf
http://psasir.upm.edu.my/id/eprint/30629/
_version_ 1643830116562763776