A study on dual topic models and bayesian topic model inference
Topic models have been shown to be a powerful tool for organizing large collections of unstructured data, including text, images and location data. The essence of the topic model is to discover hidden clusters that summarize the relationship between two discrete random variables, such as documents a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/145117 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Topic models have been shown to be a powerful tool for organizing large collections of unstructured data, including text, images and location data. The essence of the topic model is to discover hidden clusters that summarize the relationship between two discrete random variables, such as documents and words in text data, or users and places in location data. In the former case, a topic typically refers to a cluster of semantically related words and the documents are described using the extracted topics.
Existing work on unsupervised clustering analysis primarily focuses on grouping one single type of random variable. However, applications in microarray and location data have proven that mutually clustering two kinds of random variables may be beneficial. This process is known as biclustering. Previous approaches to biclustering typically rely on a strategy of hard assignment of the data to the extracted clusters, known as hard clustering. In contrast, in this thesis we study how we can leverage topic models to present a novel strategy of soft clustering for biclustering problems. One key benefit of soft assignments over hard assignments is that the former is able to identify overlapping biclusters which is an inherent characteristic of many real-world applications. Moreover, learning of topic models has been recognized as one of the most fundamental tasks. We also focus on the development of novel inference algorithms that can extract topics with an improved computational efficiency and predictive power. As we know, the sequential nature of the inference of topic models does not allow a straightforward parallel implementation. Existing distributed implementations approximate the results of single threaded algorithms and may lower the quality of the extracted topics. Thus, we also study how to implement our inference algorithm using multiple cores without affecting the quality of the hidden topic detection.
In particular, the first part of the thesis treats LDA as a soft-clustering algorithm and then it introduces the dual LDA model that can be used to obtain complementary clusters on the same dataset. For example, location data contains the information of all the places visited by a set of users. On this dataset, LDA assigns the users to the clusters and the clusters represent probability distributions over the set locations. In contrast, the dual LDA assigns the locations to the clusters and the clusters represent probability distributions over the set of users. LDA and the dual LDA cluster one dataset from two different perspectives. We then propose to combine the two LDA models into a single model that summarizes the given dataset using soft-assignment for biclusters. We demonstrate applicability of the novel model in text data as well as microarray data. In our study, we show that our model is able to reveal a greater number of high quality biclusters from microarray data than established bicluster algorithms. This result also holds for synthetic datasets. In addition, we show that the proposed model can improve other topic models in a reviewer recommendation application.
In the second part of the thesis, we shift our attention to the learning of topic models. A popular inference approach to topic models is collapsed Gibbs sampling (CGS), which typically samples a single latent topic label for each observed document-word pair. First, we extend CGS and propose to assign a compound distribution over the possible assignments for a given document-word pair. The new approach, called recursively compound allocation (RCA), is able to leverage the benefit of soft-topic model assignments and achieve improved generalization performance. Then, we study the relationship between RCA and the state augmentation methods and we derive a new generic deterministic inference method for learning a family of probabilistic topic models. One key benefit of the proposed method lies in its deterministic nature, which largely improves the running efficiency as well as predictive perplexity for inference.
We use well-known metrics to show how our proposed inference methods are able to extract topics with higher quality. In addition, our inference algorithm also shows higher predictive power in real world datasets as evidenced by evaluation based on the predictive perplexity using four topic models, two text datasets and a movie dataset. One of the evaluated topic models corresponds to our proposed topic model for biclustering. We also evaluate the impact on the predictive perplexity as we increase the number of cores as well as the speed of the algorithm. In both scenarios, we can clearly observe improvement of the proposed inference algorithm over the state of the art algorithm. Finally, we demonstrate that the increase in predictive power can be translated to an improvement in a real world application, i.e., we choose the popular document classification task. |
---|