Query processing in publish/subscribe systems for textual data streams

With the rapid development of online social media (e.g., Facebook and Flickr) and micro-blogging services (e.g., Twitter, Tumblr, and Weibo), huge amounts of streaming text data are being generated in an unprecedented scale. Such data is particularly well-suited for information dissemination. The d...

Full description

Saved in:
Bibliographic Details
Main Author: Chen, Lisi
Other Authors: Dr Gao Cong
Format: Theses and Dissertations
Language:English
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/10356/66232
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the rapid development of online social media (e.g., Facebook and Flickr) and micro-blogging services (e.g., Twitter, Tumblr, and Weibo), huge amounts of streaming text data are being generated in an unprecedented scale. Such data is particularly well-suited for information dissemination. The demand for disseminating interesting information from data stream to users gives prominence to content based publish/subscribe system, where users can personalize their requirements by issuing a subscription query and they will be notified when items matching those requirements are captured from the data stream. Although content based publish/subscribe system is successfully applied in many real-world applications, the existing work on content based publish/subscribe system has the following limitations. First, existing content based publish/subscribe systems usually do not consider the location aspect. With the deployment and use of GPS-enabled devices, spatial, or geographical, documents are emerging where content is associated with locations (e.g., Points of Interest on Google Map, check-ins on Foursquare, and geo-tagged tweets on Twitter). As a result of the development, users may want to issue subscription queries with both keyword and location requirements. For instance, a user who subscribes for promotional information of seafood restaurants may be only interested in the information posted by nearby seafood restaurants. Second, existing publish/subscribe systems do not consider the issue of query result diversification, which has drawn considerable attention as a way to increase user satisfaction in web search. To overcome the first limitation, we conduct the first study on location-aware publish/subscribe for textual data stream. Specifically, we propose a new type of subscription query, Boolean Range Continuous (BRC) query, for publish/subscribe systems, which continuously finds spatio-temporal documents whose locations fall in the query region and textual information satisfies the query Boolean predicates over a data stream. We develop an efficient system for addressing the problem. To improve the quality of results returned by each subscription query, we propose a new type of location based subscription query, Temporal Spatial-Keyword Top-k Subscription (TaSK) query, that rank-orders spatio-temporal documents and continuously maintains the top-ranked documents based on a score that considers the following three aspects: (1) Text relevance; (2) Spatial proximity; (3) Recency of document. We develop an efficient approach to maintaining the up-to-date top-k results for a large number of TaSK queries over a stream of spatio-temporal documents. To address the second limitation, we develop the first diversity-aware publish/subscribe system over a text stream. Specifically, we propose the Diversity-Aware Top-k Subscription (DAS) query, which takes into account text relevance, document recency, and result diversity in matching a new document. We propose an efficient mechanism to continuously maintain an up-to-date result set that contains k most recently returned documents over a text stream for each DAS query.