Enhancing the performance of selectivity estimation with machine learning techniques

Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods fo...

Full description

Saved in:
Bibliographic Details
Main Author: Meng, Zizhong
Other Authors: Gao Cong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/174777
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-174777
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
spellingShingle Computer and Information Science
Meng, Zizhong
Enhancing the performance of selectivity estimation with machine learning techniques
description Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods for selectivity estimation, such as histograms and sampling. These methods have good efficiency, but low accuracy since most of these methods are based on attribute independence assumptions. In recent years, machine learning shows good performance in many fields. Some studies introduce learning methods for selectivity estimation. These estimators outperform statistical methods in accuracy. However, current learning based estimators have some downsides and limitations, which make them not practical to be implemented into database management systems. In this dissertation, we point out limitations of state-of-the-art estimators and develop new methods for selectivity estimation. On the one hand, our methods have better performance than state-of-the-art estimators. On the other hand, our methods could handle more types of data and queries. In the first study, we focus on the dataset with large domain continuous attributes. Deep autoregressive models could achieve high accuracy on selectivity estimation. However, when datasets have continuous attributes with large domain sizes, the estimation accuracy and efficiency will be decreased dramatically. In this study, we propose a novel model IAM, which integrates Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models could reduce the domain sizes of continuous attributes to simplify the data distribution. On the other hand, the deep autoregressive model could learn the joint data distribution accurately. We also develop an unbiased progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental results demonstrate that IAM can achieve higher accuracy than state-of-the-art estimators, and has the best query performance while being integrated into a query optimizer. In the second study, we focus on the dataset containing set-valued attributes and queries containing set containment search predicates. For the best of our knowledge, this is the first work which could handle queries involving point, range and set containment search predicates. In this study, we propose a selectivity estimator for dataset with setvalued attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for efficiently column factorization and query conversion, namely ST and ST-hist, and implement them into three estimators. In the third study, we focus on query workload drift problem on query-driven estimation. Query-driven estimators have poor performance if the distribution of target query workload is different with the distribution of source query workload. In this study, we first propose to take query template distribution as the query workload distribution and detect query workload drift with KL-divergence. We then propose a contrastive learning based estimators for query workload distribution adaptation if the query workload drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query workload and target query workload in the adaptation training stage, which improves the estimation accuracy on the target query workload.
author2 Gao Cong
author_facet Gao Cong
Meng, Zizhong
format Thesis-Doctor of Philosophy
author Meng, Zizhong
author_sort Meng, Zizhong
title Enhancing the performance of selectivity estimation with machine learning techniques
title_short Enhancing the performance of selectivity estimation with machine learning techniques
title_full Enhancing the performance of selectivity estimation with machine learning techniques
title_fullStr Enhancing the performance of selectivity estimation with machine learning techniques
title_full_unstemmed Enhancing the performance of selectivity estimation with machine learning techniques
title_sort enhancing the performance of selectivity estimation with machine learning techniques
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/174777
_version_ 1806059875132768256
spelling sg-ntu-dr.10356-1747772024-05-03T02:58:53Z Enhancing the performance of selectivity estimation with machine learning techniques Meng, Zizhong Gao Cong School of Computer Science and Engineering gaocong@ntu.edu.sg Computer and Information Science Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods for selectivity estimation, such as histograms and sampling. These methods have good efficiency, but low accuracy since most of these methods are based on attribute independence assumptions. In recent years, machine learning shows good performance in many fields. Some studies introduce learning methods for selectivity estimation. These estimators outperform statistical methods in accuracy. However, current learning based estimators have some downsides and limitations, which make them not practical to be implemented into database management systems. In this dissertation, we point out limitations of state-of-the-art estimators and develop new methods for selectivity estimation. On the one hand, our methods have better performance than state-of-the-art estimators. On the other hand, our methods could handle more types of data and queries. In the first study, we focus on the dataset with large domain continuous attributes. Deep autoregressive models could achieve high accuracy on selectivity estimation. However, when datasets have continuous attributes with large domain sizes, the estimation accuracy and efficiency will be decreased dramatically. In this study, we propose a novel model IAM, which integrates Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models could reduce the domain sizes of continuous attributes to simplify the data distribution. On the other hand, the deep autoregressive model could learn the joint data distribution accurately. We also develop an unbiased progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental results demonstrate that IAM can achieve higher accuracy than state-of-the-art estimators, and has the best query performance while being integrated into a query optimizer. In the second study, we focus on the dataset containing set-valued attributes and queries containing set containment search predicates. For the best of our knowledge, this is the first work which could handle queries involving point, range and set containment search predicates. In this study, we propose a selectivity estimator for dataset with setvalued attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for efficiently column factorization and query conversion, namely ST and ST-hist, and implement them into three estimators. In the third study, we focus on query workload drift problem on query-driven estimation. Query-driven estimators have poor performance if the distribution of target query workload is different with the distribution of source query workload. In this study, we first propose to take query template distribution as the query workload distribution and detect query workload drift with KL-divergence. We then propose a contrastive learning based estimators for query workload distribution adaptation if the query workload drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query workload and target query workload in the adaptation training stage, which improves the estimation accuracy on the target query workload. Doctor of Philosophy 2024-04-11T00:19:18Z 2024-04-11T00:19:18Z 2024 Thesis-Doctor of Philosophy Meng, Z. (2024). Enhancing the performance of selectivity estimation with machine learning techniques. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/174777 https://hdl.handle.net/10356/174777 10.32657/10356/174777 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University