Automatic topic detection of documents

The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This...

Full description

Saved in:
Bibliographic Details
Main Author: Chia, Darren Kok Seng
Other Authors: Mao Kezhi
Format: Final Year Project
Language:English
Published: 2019
Subjects:
Online Access:http://hdl.handle.net/10356/78439
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-78439
record_format dspace
spelling sg-ntu-dr.10356-784392023-07-07T17:54:40Z Automatic topic detection of documents Chia, Darren Kok Seng Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output. Bachelor of Engineering (Electrical and Electronic Engineering) 2019-06-20T03:12:28Z 2019-06-20T03:12:28Z 2019 Final Year Project (FYP) http://hdl.handle.net/10356/78439 en Nanyang Technological University 53 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering
spellingShingle DRNTU::Engineering::Electrical and electronic engineering
Chia, Darren Kok Seng
Automatic topic detection of documents
description The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output.
author2 Mao Kezhi
author_facet Mao Kezhi
Chia, Darren Kok Seng
format Final Year Project
author Chia, Darren Kok Seng
author_sort Chia, Darren Kok Seng
title Automatic topic detection of documents
title_short Automatic topic detection of documents
title_full Automatic topic detection of documents
title_fullStr Automatic topic detection of documents
title_full_unstemmed Automatic topic detection of documents
title_sort automatic topic detection of documents
publishDate 2019
url http://hdl.handle.net/10356/78439
_version_ 1772827706529939456