Enhancing the performance of selectivity estimation with machine learning techniques

Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods fo...

Full description

Saved in:
Bibliographic Details
Main Author: Meng, Zizhong
Other Authors: Gao Cong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/174777
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods for selectivity estimation, such as histograms and sampling. These methods have good efficiency, but low accuracy since most of these methods are based on attribute independence assumptions. In recent years, machine learning shows good performance in many fields. Some studies introduce learning methods for selectivity estimation. These estimators outperform statistical methods in accuracy. However, current learning based estimators have some downsides and limitations, which make them not practical to be implemented into database management systems. In this dissertation, we point out limitations of state-of-the-art estimators and develop new methods for selectivity estimation. On the one hand, our methods have better performance than state-of-the-art estimators. On the other hand, our methods could handle more types of data and queries. In the first study, we focus on the dataset with large domain continuous attributes. Deep autoregressive models could achieve high accuracy on selectivity estimation. However, when datasets have continuous attributes with large domain sizes, the estimation accuracy and efficiency will be decreased dramatically. In this study, we propose a novel model IAM, which integrates Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models could reduce the domain sizes of continuous attributes to simplify the data distribution. On the other hand, the deep autoregressive model could learn the joint data distribution accurately. We also develop an unbiased progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental results demonstrate that IAM can achieve higher accuracy than state-of-the-art estimators, and has the best query performance while being integrated into a query optimizer. In the second study, we focus on the dataset containing set-valued attributes and queries containing set containment search predicates. For the best of our knowledge, this is the first work which could handle queries involving point, range and set containment search predicates. In this study, we propose a selectivity estimator for dataset with setvalued attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for efficiently column factorization and query conversion, namely ST and ST-hist, and implement them into three estimators. In the third study, we focus on query workload drift problem on query-driven estimation. Query-driven estimators have poor performance if the distribution of target query workload is different with the distribution of source query workload. In this study, we first propose to take query template distribution as the query workload distribution and detect query workload drift with KL-divergence. We then propose a contrastive learning based estimators for query workload distribution adaptation if the query workload drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query workload and target query workload in the adaptation training stage, which improves the estimation accuracy on the target query workload.