Phishing email detection using machine learning

Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations,...

Full description

Saved in:

Bibliographic Details
Main Author:	Goh, Ying Ting
Other Authors:	Yeo Chai Kiat
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/148664
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-148664
record_format	dspace
spelling	sg-ntu-dr.10356-1486642021-05-15T12:38:50Z Phishing email detection using machine learning Goh, Ying Ting Yeo Chai Kiat School of Computer Science and Engineering Government Technology Agency Singapore Gareth Yeo ASCKYEO@ntu.edu.sg Engineering::Computer science and engineering Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process. Bachelor of Engineering (Computer Science) 2021-05-15T12:38:50Z 2021-05-15T12:38:50Z 2021 Final Year Project (FYP) Goh, Y. T. (2021). Phishing email detection using machine learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/148664 https://hdl.handle.net/10356/148664 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Goh, Ying Ting Phishing email detection using machine learning
description	Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process.
author2	Yeo Chai Kiat
author_facet	Yeo Chai Kiat Goh, Ying Ting
format	Final Year Project
author	Goh, Ying Ting
author_sort	Goh, Ying Ting
title	Phishing email detection using machine learning
title_short	Phishing email detection using machine learning
title_full	Phishing email detection using machine learning
title_fullStr	Phishing email detection using machine learning
title_full_unstemmed	Phishing email detection using machine learning
title_sort	phishing email detection using machine learning
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/148664
_version_	1701270579818004480

Phishing email detection using machine learning

Similar Items