Clustering word embeddings with different properties for topic modelling

The goal of topic detection or topic modelling is to uncover the hidden topics in a large corpus. It’s an increasingly useful analysis tool in the information age. As research progresses, topic modelling methods have gradually expanded from probabilistic methods to distributed representations. The e...

Full description

Saved in:

Bibliographic Details
Main Author:	Wu, Yijun
Other Authors:	Lihui Chen
Format:	Thesis-Master by Coursework
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Electrical and electronic engineering
Online Access:	https://hdl.handle.net/10356/152557
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-152557
record_format	dspace
spelling	sg-ntu-dr.10356-1525572023-07-04T17:35:49Z Clustering word embeddings with different properties for topic modelling Wu, Yijun Lihui Chen School of Electrical and Electronic Engineering ELHCHEN@ntu.edu.sg Engineering::Electrical and electronic engineering The goal of topic detection or topic modelling is to uncover the hidden topics in a large corpus. It’s an increasingly useful analysis tool in the information age. As research progresses, topic modelling methods have gradually expanded from probabilistic methods to distributed representations. The emergence of BERT announced a huge change in the NLP paradigm. This deep model pre-trained from the unlabeled datasets significantly improved the accuracy of the NLP task, and the topic modelling method was further extended to use pre-trained word embeddings to complete. The success has inspired more BERT-based models. This project is about topic modelling with machine learning techniques. Pre-trained or fine-tuned embeddings from BERT, SBERT, ERNIE 2.0 or SimCSE are applied together with weighted K-means, where document statistics are used as weighting factor, to group or re-rank top words to identify top-20 topics using open source 20 newsgroups dataset. The results show that the SBERT and SimCSE_sup models outperform the others. In addition, the properties of different models embeddings and their impact on topic identification have also been discussed. Master of Science (Communications Engineering) 2021-08-31T05:35:28Z 2021-08-31T05:35:28Z 2021 Thesis-Master by Coursework Wu, Y. (2021). Clustering word embeddings with different properties for topic modelling. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/152557 https://hdl.handle.net/10356/152557 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering
spellingShingle	Engineering::Electrical and electronic engineering Wu, Yijun Clustering word embeddings with different properties for topic modelling
description	The goal of topic detection or topic modelling is to uncover the hidden topics in a large corpus. It’s an increasingly useful analysis tool in the information age. As research progresses, topic modelling methods have gradually expanded from probabilistic methods to distributed representations. The emergence of BERT announced a huge change in the NLP paradigm. This deep model pre-trained from the unlabeled datasets significantly improved the accuracy of the NLP task, and the topic modelling method was further extended to use pre-trained word embeddings to complete. The success has inspired more BERT-based models. This project is about topic modelling with machine learning techniques. Pre-trained or fine-tuned embeddings from BERT, SBERT, ERNIE 2.0 or SimCSE are applied together with weighted K-means, where document statistics are used as weighting factor, to group or re-rank top words to identify top-20 topics using open source 20 newsgroups dataset. The results show that the SBERT and SimCSE_sup models outperform the others. In addition, the properties of different models embeddings and their impact on topic identification have also been discussed.
author2	Lihui Chen
author_facet	Lihui Chen Wu, Yijun
format	Thesis-Master by Coursework
author	Wu, Yijun
author_sort	Wu, Yijun
title	Clustering word embeddings with different properties for topic modelling
title_short	Clustering word embeddings with different properties for topic modelling
title_full	Clustering word embeddings with different properties for topic modelling
title_fullStr	Clustering word embeddings with different properties for topic modelling
title_full_unstemmed	Clustering word embeddings with different properties for topic modelling
title_sort	clustering word embeddings with different properties for topic modelling
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/152557
_version_	1772828233916481536

Clustering word embeddings with different properties for topic modelling

Similar Items