Machine learning based web page classifier
In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Intern...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2016
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/68086 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-68086 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-680862023-07-07T15:56:24Z Machine learning based web page classifier Setiawan, Andri Chang Chip Hong School of Electrical and Electronic Engineering DRNTU::Engineering In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2]. Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input. Bachelor of Engineering 2016-05-24T04:52:19Z 2016-05-24T04:52:19Z 2016 Final Year Project (FYP) http://hdl.handle.net/10356/68086 en Nanyang Technological University 53 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering |
spellingShingle |
DRNTU::Engineering Setiawan, Andri Machine learning based web page classifier |
description |
In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2].
Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically.
In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input. |
author2 |
Chang Chip Hong |
author_facet |
Chang Chip Hong Setiawan, Andri |
format |
Final Year Project |
author |
Setiawan, Andri |
author_sort |
Setiawan, Andri |
title |
Machine learning based web page classifier |
title_short |
Machine learning based web page classifier |
title_full |
Machine learning based web page classifier |
title_fullStr |
Machine learning based web page classifier |
title_full_unstemmed |
Machine learning based web page classifier |
title_sort |
machine learning based web page classifier |
publishDate |
2016 |
url |
http://hdl.handle.net/10356/68086 |
_version_ |
1772828146692784128 |