Clustering word embeddings with different properties for topic modelling
The goal of topic detection or topic modelling is to uncover the hidden topics in a large corpus. It’s an increasingly useful analysis tool in the information age. As research progresses, topic modelling methods have gradually expanded from probabilistic methods to distributed representations. The e...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/152557 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The goal of topic detection or topic modelling is to uncover the hidden topics in a large corpus. It’s an increasingly useful analysis tool in the information age. As research progresses, topic modelling methods have gradually expanded from probabilistic methods to distributed representations. The emergence of BERT announced a huge change in the NLP paradigm. This deep model pre-trained from the unlabeled datasets significantly improved the accuracy of the NLP task, and the topic modelling method was further extended to use pre-trained word embeddings to complete. The success has inspired more BERT-based models. This project is about topic modelling with machine learning techniques. Pre-trained or fine-tuned embeddings from BERT, SBERT, ERNIE 2.0 or SimCSE are applied together with weighted K-means, where document statistics are used as weighting factor, to group or re-rank top words to identify top-20 topics using open source 20 newsgroups dataset. The results show that the SBERT and SimCSE_sup models outperform the others. In addition, the properties of different models embeddings and their impact on topic identification have also been discussed. |
---|