Machine learning based web page classifier

In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Intern...

Full description

Saved in:

Bibliographic Details
Main Author:	Setiawan, Andri
Other Authors:	Chang Chip Hong
Format:	Final Year Project
Language:	English
Published:	2016
Subjects:	DRNTU::Engineering
Online Access:	http://hdl.handle.net/10356/68086
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-68086
record_format	dspace
spelling	sg-ntu-dr.10356-680862023-07-07T15:56:24Z Machine learning based web page classifier Setiawan, Andri Chang Chip Hong School of Electrical and Electronic Engineering DRNTU::Engineering In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2]. Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input. Bachelor of Engineering 2016-05-24T04:52:19Z 2016-05-24T04:52:19Z 2016 Final Year Project (FYP) http://hdl.handle.net/10356/68086 en Nanyang Technological University 53 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering
spellingShingle	DRNTU::Engineering Setiawan, Andri Machine learning based web page classifier
description	In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2]. Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input.
author2	Chang Chip Hong
author_facet	Chang Chip Hong Setiawan, Andri
format	Final Year Project
author	Setiawan, Andri
author_sort	Setiawan, Andri
title	Machine learning based web page classifier
title_short	Machine learning based web page classifier
title_full	Machine learning based web page classifier
title_fullStr	Machine learning based web page classifier
title_full_unstemmed	Machine learning based web page classifier
title_sort	machine learning based web page classifier
publishDate	2016
url	http://hdl.handle.net/10356/68086
_version_	1772828146692784128

Machine learning based web page classifier

Similar Items