Effective and efficient topic mining and exploration from geo-textual data
With the prevalence of online social media (e.g, Facebook, Twitter), location-based services (e.g., Foursquare, Yelp, Flickr), and GPS-enabled devices, a huge number of documents with spatial information are being generated. Such documents are associated with either points of interest (e.g., restaur...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/73632 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the prevalence of online social media (e.g, Facebook, Twitter), location-based services (e.g., Foursquare, Yelp, Flickr), and GPS-enabled devices, a huge number of documents with spatial information are being generated. Such documents are associated with either points of interest (e.g., restaurants) or latitude-longitude coordinates. We call these documents geo-textual documents. Geo-textual documents often contain information that indicates public/individual views and interests. It is of great interest to mine and explore topics from geo-textual documents to help various practical tasks, e.g., business analytics, point-of-interest (POI) recommendation, user recommendation, topic exploration, etc. There are two types of studies on mining topics from geo-textual data — (1) discovering topics of individuals from POI-associated posts (e.g., check-ins); and (2) mining and exploring topics of regions from geo-tagged microblogs. However, both types of studies have several limitations. Firstly, the topics of individuals that are mined from geo-textual data are successfully applied to POI recommendation, location prediction, etc. However, most of the existing methods mine topics from check-in datasets from Foursquare. Because each check-in often consists of limited textual information, and most of the users only shared few check-ins, it is difficult to discover meaningful topics of individuals from the check-in data. Moreover, the existing methods cannot capture topical aspects, e.g., the “environment” of a restaurant, thus failing to tell users why a POI is recommended to the user. Worse still, the existing methods are frequency-based (the more a topic is mentioned, the more likely a user prefers the topic), while ignoring the user’s sentiment. A user may hold negative opinions on some topics even though he/she mentions them many times. Secondly, the existing studies on learning topics of regions only allow users to explore the topics in predefined regions and time spans. A user may want to query topics within a specified region and time span. For example, a social scientist may want to find out breaking events by submitting regions and time spans in an exploratory manner. Some studies propose to learn geographical topic models to uncover latent regions and geographical topics. However, training these models is time consuming. It often takes months to train a model of moderate size (e.g., thousands of topics and thousands of regions) on millions of documents. However, there exists no distributed solution for training geographical topic models. To overcome the limitations in mining topics of individuals, we address two research challenges. First, we propose an approach to associating POIs with geo-tagged microblogs to compose a complementary “check-in” data source for topic mining of individuals. Second, we propose a unified model for learning topical aspects and regions of individuals with consideration of sentiment. The proposed model is able to improve the effectiveness of many downstream applications, e.g., POI recommendation, user recommendation, aspect satisfaction analysis, etc. To overcome the limitations in mining topics of regions, we consider two research problems. First, we develop a framework for exploring topics within a user specified region and time span. The framework can return topics fall in the spatio-temporal query to a user within seconds. Second, to allow efficient training of geographical topic models, we propose a distributed solution that supports learning large geographical topic models with millions of parameters from tens of gigabytes of geo-textual documents within 20 hours on a small cluster of 20 machines. |
---|