Phishing email detection using machine learning

Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations,...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Goh, Ying Ting
مؤلفون آخرون:	Yeo Chai Kiat
التنسيق:	Final Year Project
اللغة:	English
منشور في:	Nanyang Technological University 2021
الموضوعات:	Engineering::Computer science and engineering
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/148664
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

الوصف
الملخص:	Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process.

Phishing email detection using machine learning

مواد مشابهة