A dataset for web page classification

The internet is becoming larger every day, giving people public access to millions of web pages. Unfortunately, in this vast domain of web pages, the presence of malicious web pages is inevitable. Thus, the detection of these malicious web pages can contribute to the security and protection of inter...

Full description

Saved in:

Bibliographic Details
Main Authors:	Asoy, Julian Y., Domingo, Kenneth Vincent G.
Format:	text
Language:	English
Published:	Animo Repository 2016
Subjects:	Web sites > Security measures Computer Sciences
Online Access:	https://animorepository.dlsu.edu.ph/etd_bachelors/14967
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University
Language:	English

id	oai:animorepository.dlsu.edu.ph:etd_bachelors-6208
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:etd_bachelors-62082021-05-10T08:42:24Z A dataset for web page classification Asoy, Julian Y. Domingo, Kenneth Vincent G. The internet is becoming larger every day, giving people public access to millions of web pages. Unfortunately, in this vast domain of web pages, the presence of malicious web pages is inevitable. Thus, the detection of these malicious web pages can contribute to the security and protection of internet users. There have been various studies regarding the malicious web page detection. In these studies, a variety of significant web page features are extracted from collected URLs and web pages to characterize malicious and benign web pages. In addition, different machine learning techniques were used together with the features to detect malicious web pages.;"This study focuses on the data collection and feature extraction processes involved in malicious web page detection by creating a usable dataset of URLs, web page contents, and features. The dataset contains 58,654 usable web pages. Of which , 61% are benign web pages and 39% are malicious. Each web page has 24 features. 15 of these features have been used in previous studies, while the other 9 are experimental features based on HTML5. The dataset was used in 3 tests, each with different sets of features in order to evaluate their effectiveness. A total of 6 types of machine learning techniques, classified as parametric (naive bayes (NB)) and logistic regression (LR) and non-parametric (decision tree (DT)), support vector machine (SVM), random forest (RF), and K-nearest neighbor (KNN)), are used to experiment and test the usefulness of the datasets. Results show that non-parametric classifiers performed better in terms of kappa and accuracy. However, with an unbalanced dataset, classifiers are evaluated based on their precision and recall. Overall results showed that the SVM yielded the highest precision and recall values, with 81.92% and 73.4% respectively. 2016-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_bachelors/14967 Bachelor's Theses English Animo Repository Web sites--Security measures Computer Sciences
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
language	English
topic	Web sites--Security measures Computer Sciences
spellingShingle	Web sites--Security measures Computer Sciences Asoy, Julian Y. Domingo, Kenneth Vincent G. A dataset for web page classification
description	The internet is becoming larger every day, giving people public access to millions of web pages. Unfortunately, in this vast domain of web pages, the presence of malicious web pages is inevitable. Thus, the detection of these malicious web pages can contribute to the security and protection of internet users. There have been various studies regarding the malicious web page detection. In these studies, a variety of significant web page features are extracted from collected URLs and web pages to characterize malicious and benign web pages. In addition, different machine learning techniques were used together with the features to detect malicious web pages.;"This study focuses on the data collection and feature extraction processes involved in malicious web page detection by creating a usable dataset of URLs, web page contents, and features. The dataset contains 58,654 usable web pages. Of which , 61% are benign web pages and 39% are malicious. Each web page has 24 features. 15 of these features have been used in previous studies, while the other 9 are experimental features based on HTML5. The dataset was used in 3 tests, each with different sets of features in order to evaluate their effectiveness. A total of 6 types of machine learning techniques, classified as parametric (naive bayes (NB)) and logistic regression (LR) and non-parametric (decision tree (DT)), support vector machine (SVM), random forest (RF), and K-nearest neighbor (KNN)), are used to experiment and test the usefulness of the datasets. Results show that non-parametric classifiers performed better in terms of kappa and accuracy. However, with an unbalanced dataset, classifiers are evaluated based on their precision and recall. Overall results showed that the SVM yielded the highest precision and recall values, with 81.92% and 73.4% respectively.
format	text
author	Asoy, Julian Y. Domingo, Kenneth Vincent G.
author_facet	Asoy, Julian Y. Domingo, Kenneth Vincent G.
author_sort	Asoy, Julian Y.
title	A dataset for web page classification
title_short	A dataset for web page classification
title_full	A dataset for web page classification
title_fullStr	A dataset for web page classification
title_full_unstemmed	A dataset for web page classification
title_sort	dataset for web page classification
publisher	Animo Repository
publishDate	2016
url	https://animorepository.dlsu.edu.ph/etd_bachelors/14967
_version_	1718382543309373440

A dataset for web page classification

Similar Items