Phishing email detection using machine learning

Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations,...

Full description

Saved in:
Bibliographic Details
Main Author: Goh, Ying Ting
Other Authors: Yeo Chai Kiat
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148664
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-148664
record_format dspace
spelling sg-ntu-dr.10356-1486642021-05-15T12:38:50Z Phishing email detection using machine learning Goh, Ying Ting Yeo Chai Kiat School of Computer Science and Engineering Government Technology Agency Singapore Gareth Yeo ASCKYEO@ntu.edu.sg Engineering::Computer science and engineering Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process. Bachelor of Engineering (Computer Science) 2021-05-15T12:38:50Z 2021-05-15T12:38:50Z 2021 Final Year Project (FYP) Goh, Y. T. (2021). Phishing email detection using machine learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/148664 https://hdl.handle.net/10356/148664 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Goh, Ying Ting
Phishing email detection using machine learning
description Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process.
author2 Yeo Chai Kiat
author_facet Yeo Chai Kiat
Goh, Ying Ting
format Final Year Project
author Goh, Ying Ting
author_sort Goh, Ying Ting
title Phishing email detection using machine learning
title_short Phishing email detection using machine learning
title_full Phishing email detection using machine learning
title_fullStr Phishing email detection using machine learning
title_full_unstemmed Phishing email detection using machine learning
title_sort phishing email detection using machine learning
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/148664
_version_ 1701270579818004480