IMPROVING THE PERFORMANCE OF HDBSCAN ON SHORT TEXT CLUSTERING BY USING WORD EMBEDDINGS AND UMAP

Short text is one of the data formats usually generated by people on social media, for instance, tweets on Twitter. They are often used as data to analyze what is trending in the community. However, topic modeling or text clustering algorithms on short text have some unique problems. Namely, s...

Full description

Saved in:
Bibliographic Details
Main Author: Sidik Asyaky, Muhammad
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/58051
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Short text is one of the data formats usually generated by people on social media, for instance, tweets on Twitter. They are often used as data to analyze what is trending in the community. However, topic modeling or text clustering algorithms on short text have some unique problems. Namely, sparsity which is caused by too many unique words only appear in few documents, and a lack of word cooccurrences that makes it difficult for the system to find semantic information of words. To overcome those two problems, we propose a novel method to use the word embedding technique to represent the document in vector space. FastText and BERT embedding models is chosen because of the quality of their text representation and the ability to handle (Out of Vocabulary) OOV words. As for clustering, the HDBSCAN algorithm is used because of its ability to handle noise. However, it has poor performance on clustering high-dimensional data. Because vectors resulting from word embedding are high-dimensional, therefore dimension reduction by UMAP is done on the vectors before feeding it to HDBSCAN. The experimental results prove that our novel method is better than the baseline, which is evaluated on purity and NMI metrics. The clustering result can also be used as a training feature by a classifier to improve performance on classification tasks.