Deep learning-based automatic document categorization and organization

Given the vast improvement in information technology today, document classification has become a major research area of Natural Language Processing. Previously, document classification was done by using Traditional Machine Learning algorithm to categorize online documents. However, Traditional Machi...

Full description

Saved in:
Bibliographic Details
Main Author: Foo, Shawn Nicholas Say Yan
Other Authors: Mao Kezhi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/149304
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Given the vast improvement in information technology today, document classification has become a major research area of Natural Language Processing. Previously, document classification was done by using Traditional Machine Learning algorithm to categorize online documents. However, Traditional Machine Learning algorithms have shown to be unable to cope with the massive amount of online information generated daily. On the other hand, Deep Learning algorithms’ performance increases with data. Therefore, we introduce Deep Learning models to perform the document classification task, using the large amount of information data being generated daily. This project aims to build an AI system that performs document classification by using Deep Learning-based methods. In my work, 5 Deep Learning-based models are compared and evaluated. The coarse-grained classification task involves the Deep Learning-based models classifying news articles into 5 entry-level categories: Economy, Fuel Price, Illegal Fishing, Weather and Climate, and Others. A fine-grained classification task was also conducted in this project using news articles in Fuel Price category to further classify them into two subcategories: Price Increase and Price Decrease. It was identified that the model that uses TF-IDF word representation and Feedforward Artificial Neural Network outperformed all the other models with classification accuracy of 98% and 88.25% for coarse-grained and fine-grained classification task, respectively. News classification allows us to detect the occurrence of certain events. In particular, the abovementioned news classification done in this project contributes to detecting piracy in the Straits of Malacca. The project has successfully evaluated the Deep Learning-based model best use for document classification of news articles and can be utilized to analyze the trend of piracy occurring in Straits of Malacca.