Automatic topic detection of documents

The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This...

Full description

Saved in:
Bibliographic Details
Main Author: Chia, Darren Kok Seng
Other Authors: Mao Kezhi
Format: Final Year Project
Language:English
Published: 2019
Subjects:
Online Access:http://hdl.handle.net/10356/78439
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output.