Enhancing the performance of selectivity estimation with machine learning techniques
Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods fo...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/174777 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Selectivity estimation is a fundamental task in database management systems and has
been studied for decades. Given a SQL query, selectivity estimation returns the number of
records in the query result without answering the query. Most real database management
systems adopt statistical methods for selectivity estimation, such as histograms and sampling.
These methods have good efficiency, but low accuracy since most of these methods
are based on attribute independence assumptions. In recent years, machine learning shows
good performance in many fields. Some studies introduce learning methods for selectivity
estimation. These estimators outperform statistical methods in accuracy. However, current
learning based estimators have some downsides and limitations, which make them
not practical to be implemented into database management systems. In this dissertation,
we point out limitations of state-of-the-art estimators and develop new methods
for selectivity estimation. On the one hand, our methods have better performance than
state-of-the-art estimators. On the other hand, our methods could handle more types of
data and queries.
In the first study, we focus on the dataset with large domain continuous attributes.
Deep autoregressive models could achieve high accuracy on selectivity estimation. However,
when datasets have continuous attributes with large domain sizes, the estimation
accuracy and efficiency will be decreased dramatically. In this study, we propose a novel
model IAM, which integrates Gaussian mixture models and a deep autoregressive model.
On the one hand, Gaussian mixture models could reduce the domain sizes of continuous
attributes to simplify the data distribution. On the other hand, the deep autoregressive
model could learn the joint data distribution accurately. We also develop an unbiased
progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental
results demonstrate that IAM can achieve higher accuracy than state-of-the-art
estimators, and has the best query performance while being integrated into a query optimizer.
In the second study, we focus on the dataset containing set-valued attributes and
queries containing set containment search predicates. For the best of our knowledge, this
is the first work which could handle queries involving point, range and set containment
search predicates. In this study, we propose a selectivity estimator for dataset with setvalued
attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column
is converted to multiple numeric subcolumns, and set containment predicates are
converted to numeric comparison predicates. This enables us to leverage any existing
estimator to perform selectivity estimation. We then develop two methods for efficiently
column factorization and query conversion, namely ST and ST-hist, and implement them
into three estimators.
In the third study, we focus on query workload drift problem on query-driven estimation.
Query-driven estimators have poor performance if the distribution of target query
workload is different with the distribution of source query workload. In this study, we
first propose to take query template distribution as the query workload distribution and
detect query workload drift with KL-divergence. We then propose a contrastive learning
based estimators for query workload distribution adaptation if the query workload
drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query
workload and target query workload in the adaptation training stage, which improves
the estimation accuracy on the target query workload. |
---|