Distributed systems for spatio-textual data streams
Due to the prosperity of social networks and smart phones, huge amounts of data with both spatial and textual information, e.g., geo-tagged tweets, is generated continuously, which can be modelled as data streams. Such spatio-textual data stream contains valuable information for millions of users wi...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/106448 http://hdl.handle.net/10220/47970 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Due to the prosperity of social networks and smart phones, huge amounts of data with both spatial and textual information, e.g., geo-tagged tweets, is generated continuously, which can be modelled as data streams. Such spatio-textual data stream contains valuable information for millions of users with various interests on different keywords and locations. There has been increasing demand for efficiently exploring and processing spatio-textual data streams, which calls for systems that can provide real-time analytical results over the spatio-textual data.
Publish/subscribe systems enable efficient and effective information distribution by allowing users to register continuous queries with both spatial and textual constraints. However, most existing publish/subscribe systems are centralized systems, which run on a single machine to process all the incoming data. The explosive growth of data scale and user base has posed challenges to the existing centralized publish/subscribe systems for spatio-textual data streams. To overcome these, we propose a distributed publish/subscribe system, called PS2Stream, which digests a massive spatio-textual data stream and directs the stream to target users with registered interests. Compared with existing systems, PS2Stream achieves a better workload distribution in terms of both minimizing the total amount of workload and balancing the load of workers. To achieve this, we propose a new workload distribution algorithm considering both space and text properties of the data. Additionally, PS2Stream supports dynamic load adjustments to adapt to the change of the workload, which makes PS2Stream adaptive. Extensive empirical evaluation, on commercial cloud computing platform with real data, validates the superiority of our system design and advantages of our techniques on system performance improvement.
Publish/subscribe systems provide efficient ways to analyze the spatio-textual data at the tuple level, which return a set of spatio-textual objects satisfying the continuous queries in real time. However, in some scenarios, users are more interested in the higher level knowledge that can be extracted from the data. For instance, a marketing manager wants to know the popularity of some product in different regions, so that he or she can decide whether need to adjust the advertising strategy. A data stream warehouse system (DSWS) has the features of e cient data ingestion and enabling online analytical processing (OLAP) over streaming data. Unfortunately, existing DSWSs are not tailored for spatio-textual data and it requires a significant amount of efforts to address this.
We develop a DSWS called STAR (Spatio-Textual Data Stream Warehouse). STAR is a distributed in-memory stream warehouse system, which can provide low-latency and up-to-date analytical results over a fast arriving spatio-textual data stream. STAR facilitates processing of ad-hoc aggregation queries with spatial or textual constraints by implementing a distributed view materialization algorithm. STAR adopts an effective workload partitioning strategy, which well partitions the workload composed of object processing, query processing and view maintaining. Additionally, STAR supports dynamic load adjustments, which make STAR scalable and adaptive. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems. |
---|