Machine learning for email filtering and categorising
As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news da...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project / Dissertation / Thesis |
Published: |
2023
|
Subjects: | |
Online Access: | http://eprints.utar.edu.my/6154/1/TAN_KAI_QIN%2D1906282.pdf http://eprints.utar.edu.my/6154/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Tunku Abdul Rahman |
Summary: | As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news dataset, and XSum dataset, respectively. Owing to the unprecedented growth in email transactions, businesses generally require an automated email management system to manage their mailbox, including applications in customer service and internal email. This project encompasses the classical machine learning method, conventional neural networks, and transformers for the tasks. For instance, a comparison is made for each task, and the model with the highest accuracy and F1 score is selected. Regarding the best performing model, they are Long Short-Term Memory (LSTM), Bi-directional LSTM, and PEGASUS for spam filtering, categorising, and summarising, respectively. Both LSTM and Bi-LSTM achieved the highest accuracy on the filtering and categorising tasks, with 99% and 92%, respectively. Similarly, the PEGASUS transformer has leveraged the summary similarity score by about 15% higher in all categories than the conventional neural network. The comparison concludes that limitations on training and machine specification will affect transformer’s performance in categorisation work. Conventional neural networks have the upper hand in text categorisation under the limitations, but transformers showed better resilience in summarisation owing to its unique training method. Interestingly, the neural network and transformer could not differentiate the similarities between different categories resulting in slightly lower accuracy. Furthermore, this project also presents a web-based interface for the three tasks to demonstrate the feasibility of the selected model in each designated task. |
---|