Content and link based web spam detection

Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most r...

Full description

Saved in:
Bibliographic Details
Main Authors: Canete, Arien Kris Jacob, Gervacio, Paolo Miguel, Kim, Dong-Hwan, Quinto, Rafael
Format: text
Language:English
Published: Animo Repository 2012
Online Access:https://animorepository.dlsu.edu.ph/etd_bachelors/14785
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etd_bachelors-15427
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etd_bachelors-154272021-11-23T04:15:01Z Content and link based web spam detection Canete, Arien Kris Jacob Gervacio, Paolo Miguel Kim, Dong-Hwan Quinto, Rafael Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection. 2012-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_bachelors/14785 Bachelor's Theses English Animo Repository
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
description Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection.
format text
author Canete, Arien Kris Jacob
Gervacio, Paolo Miguel
Kim, Dong-Hwan
Quinto, Rafael
spellingShingle Canete, Arien Kris Jacob
Gervacio, Paolo Miguel
Kim, Dong-Hwan
Quinto, Rafael
Content and link based web spam detection
author_facet Canete, Arien Kris Jacob
Gervacio, Paolo Miguel
Kim, Dong-Hwan
Quinto, Rafael
author_sort Canete, Arien Kris Jacob
title Content and link based web spam detection
title_short Content and link based web spam detection
title_full Content and link based web spam detection
title_fullStr Content and link based web spam detection
title_full_unstemmed Content and link based web spam detection
title_sort content and link based web spam detection
publisher Animo Repository
publishDate 2012
url https://animorepository.dlsu.edu.ph/etd_bachelors/14785
_version_ 1718383366780223488