Content and link based web spam detection
Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most r...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2012
|
Online Access: | https://animorepository.dlsu.edu.ph/etd_bachelors/14785 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
id |
oai:animorepository.dlsu.edu.ph:etd_bachelors-15427 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:etd_bachelors-154272021-11-23T04:15:01Z Content and link based web spam detection Canete, Arien Kris Jacob Gervacio, Paolo Miguel Kim, Dong-Hwan Quinto, Rafael Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection. 2012-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_bachelors/14785 Bachelor's Theses English Animo Repository |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
language |
English |
description |
Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection. |
format |
text |
author |
Canete, Arien Kris Jacob Gervacio, Paolo Miguel Kim, Dong-Hwan Quinto, Rafael |
spellingShingle |
Canete, Arien Kris Jacob Gervacio, Paolo Miguel Kim, Dong-Hwan Quinto, Rafael Content and link based web spam detection |
author_facet |
Canete, Arien Kris Jacob Gervacio, Paolo Miguel Kim, Dong-Hwan Quinto, Rafael |
author_sort |
Canete, Arien Kris Jacob |
title |
Content and link based web spam detection |
title_short |
Content and link based web spam detection |
title_full |
Content and link based web spam detection |
title_fullStr |
Content and link based web spam detection |
title_full_unstemmed |
Content and link based web spam detection |
title_sort |
content and link based web spam detection |
publisher |
Animo Repository |
publishDate |
2012 |
url |
https://animorepository.dlsu.edu.ph/etd_bachelors/14785 |
_version_ |
1718383366780223488 |