IMPROVING THE PERFORMANCE OF HDBSCAN ON SHORT TEXT CLUSTERING BY USING WORD EMBEDDINGS AND UMAP
Short text is one of the data formats usually generated by people on social media, for instance, tweets on Twitter. They are often used as data to analyze what is trending in the community. However, topic modeling or text clustering algorithms on short text have some unique problems. Namely, s...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/58051 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Short text is one of the data formats usually generated by people on social media,
for instance, tweets on Twitter. They are often used as data to analyze what is
trending in the community. However, topic modeling or text clustering algorithms
on short text have some unique problems. Namely, sparsity which is caused by too
many unique words only appear in few documents, and a lack of word cooccurrences that makes it difficult for the system to find semantic information of
words. To overcome those two problems, we propose a novel method to use the
word embedding technique to represent the document in vector space. FastText and
BERT embedding models is chosen because of the quality of their text
representation and the ability to handle (Out of Vocabulary) OOV words. As for
clustering, the HDBSCAN algorithm is used because of its ability to handle noise.
However, it has poor performance on clustering high-dimensional data. Because
vectors resulting from word embedding are high-dimensional, therefore dimension
reduction by UMAP is done on the vectors before feeding it to HDBSCAN. The
experimental results prove that our novel method is better than the baseline, which
is evaluated on purity and NMI metrics. The clustering result can also be used as a
training feature by a classifier to improve performance on classification tasks.
|
---|