Automatic topic detection of documents

The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This...

Full description

Saved in:

Bibliographic Details
Main Author:	Chia, Darren Kok Seng
Other Authors:	Mao Kezhi
Format:	Final Year Project
Language:	English
Published:	2019
Subjects:	DRNTU::Engineering::Electrical and electronic engineering
Online Access:	http://hdl.handle.net/10356/78439
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-78439
record_format	dspace
spelling	sg-ntu-dr.10356-784392023-07-07T17:54:40Z Automatic topic detection of documents Chia, Darren Kok Seng Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output. Bachelor of Engineering (Electrical and Electronic Engineering) 2019-06-20T03:12:28Z 2019-06-20T03:12:28Z 2019 Final Year Project (FYP) http://hdl.handle.net/10356/78439 en Nanyang Technological University 53 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Electrical and electronic engineering
spellingShingle	DRNTU::Engineering::Electrical and electronic engineering Chia, Darren Kok Seng Automatic topic detection of documents
description	The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output.
author2	Mao Kezhi
author_facet	Mao Kezhi Chia, Darren Kok Seng
format	Final Year Project
author	Chia, Darren Kok Seng
author_sort	Chia, Darren Kok Seng
title	Automatic topic detection of documents
title_short	Automatic topic detection of documents
title_full	Automatic topic detection of documents
title_fullStr	Automatic topic detection of documents
title_full_unstemmed	Automatic topic detection of documents
title_sort	automatic topic detection of documents
publishDate	2019
url	http://hdl.handle.net/10356/78439
_version_	1772827706529939456

Automatic topic detection of documents

Similar Items