Machine learning for email filtering and categorising

As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news da...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Kai Qin
Format: Final Year Project / Dissertation / Thesis
Published: 2023
Subjects:
Online Access:http://eprints.utar.edu.my/6154/1/TAN_KAI_QIN%2D1906282.pdf
http://eprints.utar.edu.my/6154/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Tunku Abdul Rahman
id my-utar-eprints.6154
record_format eprints
spelling my-utar-eprints.61542023-12-12T08:24:18Z Machine learning for email filtering and categorising Tan, Kai Qin HG Finance As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news dataset, and XSum dataset, respectively. Owing to the unprecedented growth in email transactions, businesses generally require an automated email management system to manage their mailbox, including applications in customer service and internal email. This project encompasses the classical machine learning method, conventional neural networks, and transformers for the tasks. For instance, a comparison is made for each task, and the model with the highest accuracy and F1 score is selected. Regarding the best performing model, they are Long Short-Term Memory (LSTM), Bi-directional LSTM, and PEGASUS for spam filtering, categorising, and summarising, respectively. Both LSTM and Bi-LSTM achieved the highest accuracy on the filtering and categorising tasks, with 99% and 92%, respectively. Similarly, the PEGASUS transformer has leveraged the summary similarity score by about 15% higher in all categories than the conventional neural network. The comparison concludes that limitations on training and machine specification will affect transformer’s performance in categorisation work. Conventional neural networks have the upper hand in text categorisation under the limitations, but transformers showed better resilience in summarisation owing to its unique training method. Interestingly, the neural network and transformer could not differentiate the similarities between different categories resulting in slightly lower accuracy. Furthermore, this project also presents a web-based interface for the three tasks to demonstrate the feasibility of the selected model in each designated task. 2023 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/6154/1/TAN_KAI_QIN%2D1906282.pdf Tan, Kai Qin (2023) Machine learning for email filtering and categorising. Final Year Project, UTAR. http://eprints.utar.edu.my/6154/
institution Universiti Tunku Abdul Rahman
building UTAR Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Tunku Abdul Rahman
content_source UTAR Institutional Repository
url_provider http://eprints.utar.edu.my
topic HG Finance
spellingShingle HG Finance
Tan, Kai Qin
Machine learning for email filtering and categorising
description As digitalisation persists, email has become the primary communication channel for personal and business users. This project focuses on three Natural Language Processing (NLP) tasks: 1) Spam Filtering, 2) Categorising, and 3) Summarising. For each task, it is using the Enron spam dataset, AG news dataset, and XSum dataset, respectively. Owing to the unprecedented growth in email transactions, businesses generally require an automated email management system to manage their mailbox, including applications in customer service and internal email. This project encompasses the classical machine learning method, conventional neural networks, and transformers for the tasks. For instance, a comparison is made for each task, and the model with the highest accuracy and F1 score is selected. Regarding the best performing model, they are Long Short-Term Memory (LSTM), Bi-directional LSTM, and PEGASUS for spam filtering, categorising, and summarising, respectively. Both LSTM and Bi-LSTM achieved the highest accuracy on the filtering and categorising tasks, with 99% and 92%, respectively. Similarly, the PEGASUS transformer has leveraged the summary similarity score by about 15% higher in all categories than the conventional neural network. The comparison concludes that limitations on training and machine specification will affect transformer’s performance in categorisation work. Conventional neural networks have the upper hand in text categorisation under the limitations, but transformers showed better resilience in summarisation owing to its unique training method. Interestingly, the neural network and transformer could not differentiate the similarities between different categories resulting in slightly lower accuracy. Furthermore, this project also presents a web-based interface for the three tasks to demonstrate the feasibility of the selected model in each designated task.
format Final Year Project / Dissertation / Thesis
author Tan, Kai Qin
author_facet Tan, Kai Qin
author_sort Tan, Kai Qin
title Machine learning for email filtering and categorising
title_short Machine learning for email filtering and categorising
title_full Machine learning for email filtering and categorising
title_fullStr Machine learning for email filtering and categorising
title_full_unstemmed Machine learning for email filtering and categorising
title_sort machine learning for email filtering and categorising
publishDate 2023
url http://eprints.utar.edu.my/6154/1/TAN_KAI_QIN%2D1906282.pdf
http://eprints.utar.edu.my/6154/
_version_ 1787140958558617600