Automatic topic detection of news

The aim of this project is to explore the topic of Natural Language Processing and how to implement it into automatic topic detection, namely categorization and topic generation of news articles. The project will mainly focus on using unsupervised learning methods for implementation to reduce the...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Fengyuan
Other Authors: Mao Kezhi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/140587
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-140587
record_format dspace
spelling sg-ntu-dr.10356-1405872023-07-07T18:47:33Z Automatic topic detection of news Liu, Fengyuan Mao Kezhi School of Electrical and Electronic Engineering ekzmao@ntu.edu.sg Engineering::Electrical and electronic engineering The aim of this project is to explore the topic of Natural Language Processing and how to implement it into automatic topic detection, namely categorization and topic generation of news articles. The project will mainly focus on using unsupervised learning methods for implementation to reduce the amount of manual work and fulfill the “automatic” component of the project [1]. Choosing the “right” information to read on the internet is a growing issue today. It is especially true for the news segment due to the vast amount of news available online. This brings our attention to one of the current solutions which is filtering or categorizing news into different sections and topics. However, manually categorizing the news is slow and prone to error since personal opinion is involved. Hence, the drive of the project would be to explore news topic detection using machine learning. The first half of the project explores topic modeling [2] and how to categorize news text using machine learning. The methodology chosen is Latent Dirichlet Allocation [3]. This model is trained on the “20 Newsgroup” dataset which contains 20,000 news documents across 20 different fields [4]. The second half of the project used the categorized results and further fine-grained the categories by generating new topic titles to choose from. The methodology used is Word2vec pre-trained on “Text8” corpus and fine-tuned using the “20 Newsgroup” dataset. This project also experiments on different approaches and hyperparameters to further analyze the results for both techniques. Bachelor of Engineering (Information Engineering and Media) 2020-05-31T11:47:56Z 2020-05-31T11:47:56Z 2020 Final Year Project (FYP) https://hdl.handle.net/10356/140587 en A1119-191 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
spellingShingle Engineering::Electrical and electronic engineering
Liu, Fengyuan
Automatic topic detection of news
description The aim of this project is to explore the topic of Natural Language Processing and how to implement it into automatic topic detection, namely categorization and topic generation of news articles. The project will mainly focus on using unsupervised learning methods for implementation to reduce the amount of manual work and fulfill the “automatic” component of the project [1]. Choosing the “right” information to read on the internet is a growing issue today. It is especially true for the news segment due to the vast amount of news available online. This brings our attention to one of the current solutions which is filtering or categorizing news into different sections and topics. However, manually categorizing the news is slow and prone to error since personal opinion is involved. Hence, the drive of the project would be to explore news topic detection using machine learning. The first half of the project explores topic modeling [2] and how to categorize news text using machine learning. The methodology chosen is Latent Dirichlet Allocation [3]. This model is trained on the “20 Newsgroup” dataset which contains 20,000 news documents across 20 different fields [4]. The second half of the project used the categorized results and further fine-grained the categories by generating new topic titles to choose from. The methodology used is Word2vec pre-trained on “Text8” corpus and fine-tuned using the “20 Newsgroup” dataset. This project also experiments on different approaches and hyperparameters to further analyze the results for both techniques.
author2 Mao Kezhi
author_facet Mao Kezhi
Liu, Fengyuan
format Final Year Project
author Liu, Fengyuan
author_sort Liu, Fengyuan
title Automatic topic detection of news
title_short Automatic topic detection of news
title_full Automatic topic detection of news
title_fullStr Automatic topic detection of news
title_full_unstemmed Automatic topic detection of news
title_sort automatic topic detection of news
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/140587
_version_ 1772825184436224000