Machine learning for email filtering and categorising

As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news da...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Kai Qin
Format: Final Year Project / Dissertation / Thesis
Published: 2023
Subjects:
Online Access:http://eprints.utar.edu.my/6154/1/TAN_KAI_QIN%2D1906282.pdf
http://eprints.utar.edu.my/6154/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Tunku Abdul Rahman
Description
Summary:As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news dataset, and XSum dataset, respectively. Owing to the unprecedented growth in email transactions, businesses generally require an automated email management system to manage their mailbox, including applications in customer service and internal email. This project encompasses the classical machine learning method, conventional neural networks, and transformers for the tasks. For instance, a comparison is made for each task, and the model with the highest accuracy and F1 score is selected. Regarding the best performing model, they are Long Short-Term Memory (LSTM), Bi-directional LSTM, and PEGASUS for spam filtering, categorising, and summarising, respectively. Both LSTM and Bi-LSTM achieved the highest accuracy on the filtering and categorising tasks, with 99% and 92%, respectively. Similarly, the PEGASUS transformer has leveraged the summary similarity score by about 15% higher in all categories than the conventional neural network. The comparison concludes that limitations on training and machine specification will affect transformer’s performance in categorisation work. Conventional neural networks have the upper hand in text categorisation under the limitations, but transformers showed better resilience in summarisation owing to its unique training method. Interestingly, the neural network and transformer could not differentiate the similarities between different categories resulting in slightly lower accuracy. Furthermore, this project also presents a web-based interface for the three tasks to demonstrate the feasibility of the selected model in each designated task.