Content and link based web spam detection
Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most r...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2012
|
Online Access: | https://animorepository.dlsu.edu.ph/etd_bachelors/14785 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
Summary: | Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection. |
---|