Building Standard Offline Anti-phishing Dataset for Benchmarking

Anti-phishing research is one of the active research fields in information security. Due to the lack of a publicly accessible standard test dataset, most of the researchers are using their own dataset for the experiment. This makes the benchmarking across different antiphishing techniques become cha...

Full description

Saved in:
Bibliographic Details
Main Authors: Chiew, Kang Leng, Chang, Ee Hung, Tan, Choon Lin, Abdullah, Johari, Yong, Kelvin Sheng Chek
Format: Article
Language:English
Published: Science Publishing Corporation 2018
Subjects:
Online Access:http://ir.unimas.my/id/eprint/22983/1/Building%20Standard%20Offline%20Anti-phishing%20Dataset%20for%20....%20-%20Copy.pdf
http://ir.unimas.my/id/eprint/22983/
https://www.sciencepubco.com/index.php/ijet
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaysia Sarawak
Language: English
id my.unimas.ir.22983
record_format eprints
spelling my.unimas.ir.229832022-09-29T02:37:54Z http://ir.unimas.my/id/eprint/22983/ Building Standard Offline Anti-phishing Dataset for Benchmarking Chiew, Kang Leng Chang, Ee Hung Tan, Choon Lin Abdullah, Johari Yong, Kelvin Sheng Chek Q Science (General) QA75 Electronic computers. Computer science Anti-phishing research is one of the active research fields in information security. Due to the lack of a publicly accessible standard test dataset, most of the researchers are using their own dataset for the experiment. This makes the benchmarking across different antiphishing techniques become challenging and inefficient. In this paper, we propose and construct a large-scale standard offline dataset that is downloadable, universal and comprehensive. In designing the dataset creation approach, major anti-phishing techniques from the literature have been thoroughly considered to identify their unique requirements. The findings of this requirement study have concluded several influencing factors that will enhance the dataset quality, which includes: the type of raw elements, source of the sample, sample size, website category, category distribution, language of the website and the support for feature extraction. These influencing factors are the core to the proposed dataset construction approach, which produced a collection of 30,000 samples of phishing and legitimate webpages with a distribution of 50 percent of each type. Thus, this dataset is useful and compatible for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment. With the rapid development of anti-phishing research to counter the fast evolution of phishing attacks, the need of such dataset cannot be overemphasised. The complete dataset is available for download at http://www.fcsit.unimas.my/research/legit-phish-set. Science Publishing Corporation 2018 Article PeerReviewed text en http://ir.unimas.my/id/eprint/22983/1/Building%20Standard%20Offline%20Anti-phishing%20Dataset%20for%20....%20-%20Copy.pdf Chiew, Kang Leng and Chang, Ee Hung and Tan, Choon Lin and Abdullah, Johari and Yong, Kelvin Sheng Chek (2018) Building Standard Offline Anti-phishing Dataset for Benchmarking. International Journal of Engineering & Technology, 7 (4.31). pp. 7-14. ISSN 2227-524X https://www.sciencepubco.com/index.php/ijet
institution Universiti Malaysia Sarawak
building Centre for Academic Information Services (CAIS)
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sarawak
content_source UNIMAS Institutional Repository
url_provider http://ir.unimas.my/
language English
topic Q Science (General)
QA75 Electronic computers. Computer science
spellingShingle Q Science (General)
QA75 Electronic computers. Computer science
Chiew, Kang Leng
Chang, Ee Hung
Tan, Choon Lin
Abdullah, Johari
Yong, Kelvin Sheng Chek
Building Standard Offline Anti-phishing Dataset for Benchmarking
description Anti-phishing research is one of the active research fields in information security. Due to the lack of a publicly accessible standard test dataset, most of the researchers are using their own dataset for the experiment. This makes the benchmarking across different antiphishing techniques become challenging and inefficient. In this paper, we propose and construct a large-scale standard offline dataset that is downloadable, universal and comprehensive. In designing the dataset creation approach, major anti-phishing techniques from the literature have been thoroughly considered to identify their unique requirements. The findings of this requirement study have concluded several influencing factors that will enhance the dataset quality, which includes: the type of raw elements, source of the sample, sample size, website category, category distribution, language of the website and the support for feature extraction. These influencing factors are the core to the proposed dataset construction approach, which produced a collection of 30,000 samples of phishing and legitimate webpages with a distribution of 50 percent of each type. Thus, this dataset is useful and compatible for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment. With the rapid development of anti-phishing research to counter the fast evolution of phishing attacks, the need of such dataset cannot be overemphasised. The complete dataset is available for download at http://www.fcsit.unimas.my/research/legit-phish-set.
format Article
author Chiew, Kang Leng
Chang, Ee Hung
Tan, Choon Lin
Abdullah, Johari
Yong, Kelvin Sheng Chek
author_facet Chiew, Kang Leng
Chang, Ee Hung
Tan, Choon Lin
Abdullah, Johari
Yong, Kelvin Sheng Chek
author_sort Chiew, Kang Leng
title Building Standard Offline Anti-phishing Dataset for Benchmarking
title_short Building Standard Offline Anti-phishing Dataset for Benchmarking
title_full Building Standard Offline Anti-phishing Dataset for Benchmarking
title_fullStr Building Standard Offline Anti-phishing Dataset for Benchmarking
title_full_unstemmed Building Standard Offline Anti-phishing Dataset for Benchmarking
title_sort building standard offline anti-phishing dataset for benchmarking
publisher Science Publishing Corporation
publishDate 2018
url http://ir.unimas.my/id/eprint/22983/1/Building%20Standard%20Offline%20Anti-phishing%20Dataset%20for%20....%20-%20Copy.pdf
http://ir.unimas.my/id/eprint/22983/
https://www.sciencepubco.com/index.php/ijet
_version_ 1745566046672125952