Phishing email detection using machine learning
Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations,...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/148664 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process. |
---|