New indexing methods for different types of spatial data
With the proliferation of GPS devices, spatial data is being generated at an unprecedented speed. Sometimes spatial data come with textual information and temporal information, forming spatio-textual data and trajectory data respectively. We are also seeing significant growths in the interest in the...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/173587 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the proliferation of GPS devices, spatial data is being generated at an unprecedented speed. Sometimes spatial data come with textual information and temporal information, forming spatio-textual data and trajectory data respectively. We are also seeing significant growths in the interest in the management and analysis tasks (e.g., query processing) on different types of spatial data, which increasingly rely on good indexes. In this thesis, we explore various methods to build indexes for spatial data, spatio-textual data and trajectory data, in order to facilitate efficient query processing.
First, we look into spatial data indexing. In recent years, learned indexes have been proposed to replace classic index structures like B-Tree with machine learning (ML) models. They require to replace both the indexes and the query processing algorithms currently deployed by the databases, and such a radical departure is likely to encounter challenges and obstacles. In contrast, we propose a fundamentally different way of using ML techniques to build a better R-Tree without the need to change the structure or
query processing algorithms of traditional R-Tree. Specifically, we develop reinforcement learning (RL) based models to decide how to choose a subtree for insertion and how to split a node when building and updating an R-Tree, instead of relying on hand-crafted heuristic rules currently used by the R-Tree and its variants. Experiments on real and synthetic datasets with up to more than 100 million spatial objects show that our RL based index outperforms the R-Tree and its variants in terms of query processing time.
Second, we look into spatio-textual data indexing. We first propose a novel concept of local keyword frequency, which measures keyword frequency of terms in some subspace of the data space, and then leverage on that to build an index, i.e., LKFIR-Tree, through a one-by-one insertion approach. Compared with existing methods, which do not consider different data features in different parts of the data, our proposed method demonstrates superior performance when processing spatial keyword queries. Experiments on three real datasets show that LKDIR-Tree outperforms the baselines by up to 61% when processing boolean range spatial keyword queries.
Third, we look into trajectory data indexing. We propose BT-Tree, which is built through a recursive bi-partitioning approach, for the processing of range and KNN queries for past trajectory data. We first propose a cost function based method (CFBM) to build the BT-Tree. Specifically, we design a novel cost function, which incorporates the characteristics of both the data and historical query workload, to decide how to partition a BT-Tree node. Then we propose a reinforcement learning (RL) based method to address
CFBM’s limitations, such as making locally optimal decisions that may lead to global suboptimality. Experiments on three real datasets with up to 800 million data points show that the CFBM generally outperforms the baselines in terms of query processing time and the RL based method consistently outperforms the baselines and has more significant advantages on larger datasets. |
---|