Ranking user generated content using topic models

With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-base...

Full description

Saved in:
Bibliographic Details
Main Author: Ma, Zongyang
Other Authors: Sun Aixin
Format: Theses and Dissertations
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65539
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-65539
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Ma, Zongyang
Ranking user generated content using topic models
description With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-based question answering sites which let users ask and answer questions. As the result, a huge amount of User Generated Content (UGC) is accumulated online in the forms of comments, tweets, question and answer posts, and others. Depending on the platform within which UGC is created, UGC may be associated with different types of attributes such as creator, time, location, text and social connections of its creator. On the other hand, UGC data from different platforms shares similar characteristics: huge amount, free writing style, and heterogeneous nature. More importantly, UGC data often demonstrates master-slave relationship. A comment is associated with a news article; a hashtag is an annotation of its embedded tweet; an answer does not exist without a question. Here, news articles, tweets, and questions are master documents while comments, hashtags, and answers are slave documents. Although topic modeling (e.g., LDA and PLSA) has been widely used to model text collections, discovering fine-grained topics from UGC with the consideration of master-slave relationship remains an open and challenging problem. In this research, the generative process of UGC data is simulated using topic models for the ranking of slave documents of given master documents with the aim of reducing information overload. Depending on the platform that UGC data is created in, three sub-problems are defined and addressed: (i) comment ranking for news articles, (ii) hashtag ranking for tweets, and (iii) answer ranking for questions. Comment ranking is essential for identifying the important comments as a summary of user discussion for a news article. In this task, we assume that topics of slave documents cover the topics of their corresponding master document, and also the topics discussed solely in comments. For this problem, we propose two LDA-style topic models, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Evaluated on Yahoo! News, the proposed models outperform baseline methods. Hashtag ranking is important for tweet annotation and retrieval. Here, we assume that the topics of slave documents are the topical summary of their corresponding master documents. For this problem, we propose two PLSA-style topic models to model the hashtag annotation behavior. Content-Pivoted Model (CPM) assumes that tweet content guides the generation of hashtags, while Hashtag-Pivoted Model (HPM) assumes that hashtags guide the generation of tweet content. The experimental results demonstrate that CPM is most effective for ranking the most relevant hashtags of tweets. Answer ranking enables users to easily pick up the best answers for questions. In this task, we assume that topics of slave documents and topics of their corresponding master documents are similar but words of slave topics and master topics are drawn from different vocabularies. For this problem, we propose a PLSA-style topic model, namely, Tri-Role Topic Model (TRTM), to model the tri-roles of users (i.e., as askers, answerers, and voters, respectively) and the activities of each role including composing question, selecting question to answer, contributing and voting answers. Evaluated on Stack Overflow data, TRTM outperforms state-of-the-art methods for ranking high-quality answers for given questions. These three problems are all on ranking UGC data from different platforms using topic models and the proposed topic models are extended depending on the master-slave structure of UGC data. For the problem of comment ranking, the slave documents (comments) are much shorter than their corresponding master document (news article). Our main concern is discovering topics from comments which reflect the topics of their news article as well as keeping topics merely discussed among comments. For the problem of hashtag ranking, the slave documents (hashtags) are extremely short, and sometimes the hashtag is just the abbreviation of one or a few words. Compared with comment ranking, hashtag ranking is more difficult and we thus introduce more factors (e.g., user and time) to enrich the hashtag representation. Lastly, for the problem of answer ranking, the answer has an important feature of vote. It is challenging for us to model the voting behavior of users in a generative model. To address this task, we focus more on modeling the relationships between questions, answers, askers and answerers using the exponential KL-divergence function. In this research, we define three ranking problems of User Generated Content. To address these problems, we propose several extended topic models to fit the characteristics and the structure of UGC data from different platforms. From Yahoo! News to Twitter, then to Stack Overflow, the features of the adopted data in our research are more and more complicated. The designed topic models include more features and relationships to more accurately simulate the generation process of UGC data. Experimental results show that our methods outperform baseline methods for all three problems.
author2 Sun Aixin
author_facet Sun Aixin
Ma, Zongyang
format Theses and Dissertations
author Ma, Zongyang
author_sort Ma, Zongyang
title Ranking user generated content using topic models
title_short Ranking user generated content using topic models
title_full Ranking user generated content using topic models
title_fullStr Ranking user generated content using topic models
title_full_unstemmed Ranking user generated content using topic models
title_sort ranking user generated content using topic models
publishDate 2015
url https://hdl.handle.net/10356/65539
_version_ 1759853424793354240
spelling sg-ntu-dr.10356-655392023-03-04T00:46:46Z Ranking user generated content using topic models Ma, Zongyang Sun Aixin School of Computer Engineering Centre for Advanced Information Systems DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-based question answering sites which let users ask and answer questions. As the result, a huge amount of User Generated Content (UGC) is accumulated online in the forms of comments, tweets, question and answer posts, and others. Depending on the platform within which UGC is created, UGC may be associated with different types of attributes such as creator, time, location, text and social connections of its creator. On the other hand, UGC data from different platforms shares similar characteristics: huge amount, free writing style, and heterogeneous nature. More importantly, UGC data often demonstrates master-slave relationship. A comment is associated with a news article; a hashtag is an annotation of its embedded tweet; an answer does not exist without a question. Here, news articles, tweets, and questions are master documents while comments, hashtags, and answers are slave documents. Although topic modeling (e.g., LDA and PLSA) has been widely used to model text collections, discovering fine-grained topics from UGC with the consideration of master-slave relationship remains an open and challenging problem. In this research, the generative process of UGC data is simulated using topic models for the ranking of slave documents of given master documents with the aim of reducing information overload. Depending on the platform that UGC data is created in, three sub-problems are defined and addressed: (i) comment ranking for news articles, (ii) hashtag ranking for tweets, and (iii) answer ranking for questions. Comment ranking is essential for identifying the important comments as a summary of user discussion for a news article. In this task, we assume that topics of slave documents cover the topics of their corresponding master document, and also the topics discussed solely in comments. For this problem, we propose two LDA-style topic models, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Evaluated on Yahoo! News, the proposed models outperform baseline methods. Hashtag ranking is important for tweet annotation and retrieval. Here, we assume that the topics of slave documents are the topical summary of their corresponding master documents. For this problem, we propose two PLSA-style topic models to model the hashtag annotation behavior. Content-Pivoted Model (CPM) assumes that tweet content guides the generation of hashtags, while Hashtag-Pivoted Model (HPM) assumes that hashtags guide the generation of tweet content. The experimental results demonstrate that CPM is most effective for ranking the most relevant hashtags of tweets. Answer ranking enables users to easily pick up the best answers for questions. In this task, we assume that topics of slave documents and topics of their corresponding master documents are similar but words of slave topics and master topics are drawn from different vocabularies. For this problem, we propose a PLSA-style topic model, namely, Tri-Role Topic Model (TRTM), to model the tri-roles of users (i.e., as askers, answerers, and voters, respectively) and the activities of each role including composing question, selecting question to answer, contributing and voting answers. Evaluated on Stack Overflow data, TRTM outperforms state-of-the-art methods for ranking high-quality answers for given questions. These three problems are all on ranking UGC data from different platforms using topic models and the proposed topic models are extended depending on the master-slave structure of UGC data. For the problem of comment ranking, the slave documents (comments) are much shorter than their corresponding master document (news article). Our main concern is discovering topics from comments which reflect the topics of their news article as well as keeping topics merely discussed among comments. For the problem of hashtag ranking, the slave documents (hashtags) are extremely short, and sometimes the hashtag is just the abbreviation of one or a few words. Compared with comment ranking, hashtag ranking is more difficult and we thus introduce more factors (e.g., user and time) to enrich the hashtag representation. Lastly, for the problem of answer ranking, the answer has an important feature of vote. It is challenging for us to model the voting behavior of users in a generative model. To address this task, we focus more on modeling the relationships between questions, answers, askers and answerers using the exponential KL-divergence function. In this research, we define three ranking problems of User Generated Content. To address these problems, we propose several extended topic models to fit the characteristics and the structure of UGC data from different platforms. From Yahoo! News to Twitter, then to Stack Overflow, the features of the adopted data in our research are more and more complicated. The designed topic models include more features and relationships to more accurately simulate the generation process of UGC data. Experimental results show that our methods outperform baseline methods for all three problems. DOCTOR OF PHILOSOPHY (SCE) 2015-10-22T07:28:51Z 2015-10-22T07:28:51Z 2015 2015 Thesis Ma, Z. (2015). Ranking user generated content using topic models. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/65539 10.32657/10356/65539 en 147 p. application/pdf