Automatic topic detection of documents
The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/78439 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-78439 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-784392023-07-07T17:54:40Z Automatic topic detection of documents Chia, Darren Kok Seng Mao Kezhi School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output. Bachelor of Engineering (Electrical and Electronic Engineering) 2019-06-20T03:12:28Z 2019-06-20T03:12:28Z 2019 Final Year Project (FYP) http://hdl.handle.net/10356/78439 en Nanyang Technological University 53 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Electrical and electronic engineering |
spellingShingle |
DRNTU::Engineering::Electrical and electronic engineering Chia, Darren Kok Seng Automatic topic detection of documents |
description |
The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output. |
author2 |
Mao Kezhi |
author_facet |
Mao Kezhi Chia, Darren Kok Seng |
format |
Final Year Project |
author |
Chia, Darren Kok Seng |
author_sort |
Chia, Darren Kok Seng |
title |
Automatic topic detection of documents |
title_short |
Automatic topic detection of documents |
title_full |
Automatic topic detection of documents |
title_fullStr |
Automatic topic detection of documents |
title_full_unstemmed |
Automatic topic detection of documents |
title_sort |
automatic topic detection of documents |
publishDate |
2019 |
url |
http://hdl.handle.net/10356/78439 |
_version_ |
1772827706529939456 |