Enhancing the performance of selectivity estimation with machine learning techniques

Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods fo...

Full description

Saved in:

Bibliographic Details
Main Author:	Meng, Zizhong
Other Authors:	Gao Cong
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/174777
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-174777
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science
spellingShingle	Computer and Information Science Meng, Zizhong Enhancing the performance of selectivity estimation with machine learning techniques
description	Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods for selectivity estimation, such as histograms and sampling. These methods have good efficiency, but low accuracy since most of these methods are based on attribute independence assumptions. In recent years, machine learning shows good performance in many fields. Some studies introduce learning methods for selectivity estimation. These estimators outperform statistical methods in accuracy. However, current learning based estimators have some downsides and limitations, which make them not practical to be implemented into database management systems. In this dissertation, we point out limitations of state-of-the-art estimators and develop new methods for selectivity estimation. On the one hand, our methods have better performance than state-of-the-art estimators. On the other hand, our methods could handle more types of data and queries. In the first study, we focus on the dataset with large domain continuous attributes. Deep autoregressive models could achieve high accuracy on selectivity estimation. However, when datasets have continuous attributes with large domain sizes, the estimation accuracy and efficiency will be decreased dramatically. In this study, we propose a novel model IAM, which integrates Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models could reduce the domain sizes of continuous attributes to simplify the data distribution. On the other hand, the deep autoregressive model could learn the joint data distribution accurately. We also develop an unbiased progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental results demonstrate that IAM can achieve higher accuracy than state-of-the-art estimators, and has the best query performance while being integrated into a query optimizer. In the second study, we focus on the dataset containing set-valued attributes and queries containing set containment search predicates. For the best of our knowledge, this is the first work which could handle queries involving point, range and set containment search predicates. In this study, we propose a selectivity estimator for dataset with setvalued attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for efficiently column factorization and query conversion, namely ST and ST-hist, and implement them into three estimators. In the third study, we focus on query workload drift problem on query-driven estimation. Query-driven estimators have poor performance if the distribution of target query workload is different with the distribution of source query workload. In this study, we first propose to take query template distribution as the query workload distribution and detect query workload drift with KL-divergence. We then propose a contrastive learning based estimators for query workload distribution adaptation if the query workload drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query workload and target query workload in the adaptation training stage, which improves the estimation accuracy on the target query workload.
author2	Gao Cong
author_facet	Gao Cong Meng, Zizhong
format	Thesis-Doctor of Philosophy
author	Meng, Zizhong
author_sort	Meng, Zizhong
title	Enhancing the performance of selectivity estimation with machine learning techniques
title_short	Enhancing the performance of selectivity estimation with machine learning techniques
title_full	Enhancing the performance of selectivity estimation with machine learning techniques
title_fullStr	Enhancing the performance of selectivity estimation with machine learning techniques
title_full_unstemmed	Enhancing the performance of selectivity estimation with machine learning techniques
title_sort	enhancing the performance of selectivity estimation with machine learning techniques
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/174777
_version_	1806059875132768256
spelling	sg-ntu-dr.10356-1747772024-05-03T02:58:53Z Enhancing the performance of selectivity estimation with machine learning techniques Meng, Zizhong Gao Cong School of Computer Science and Engineering gaocong@ntu.edu.sg Computer and Information Science Selectivity estimation is a fundamental task in database management systems and has been studied for decades. Given a SQL query, selectivity estimation returns the number of records in the query result without answering the query. Most real database management systems adopt statistical methods for selectivity estimation, such as histograms and sampling. These methods have good efficiency, but low accuracy since most of these methods are based on attribute independence assumptions. In recent years, machine learning shows good performance in many fields. Some studies introduce learning methods for selectivity estimation. These estimators outperform statistical methods in accuracy. However, current learning based estimators have some downsides and limitations, which make them not practical to be implemented into database management systems. In this dissertation, we point out limitations of state-of-the-art estimators and develop new methods for selectivity estimation. On the one hand, our methods have better performance than state-of-the-art estimators. On the other hand, our methods could handle more types of data and queries. In the first study, we focus on the dataset with large domain continuous attributes. Deep autoregressive models could achieve high accuracy on selectivity estimation. However, when datasets have continuous attributes with large domain sizes, the estimation accuracy and efficiency will be decreased dramatically. In this study, we propose a novel model IAM, which integrates Gaussian mixture models and a deep autoregressive model. On the one hand, Gaussian mixture models could reduce the domain sizes of continuous attributes to simplify the data distribution. On the other hand, the deep autoregressive model could learn the joint data distribution accurately. We also develop an unbiased progressive sampling algorithm for accurate and fast estimation on IAM. Our experimental results demonstrate that IAM can achieve higher accuracy than state-of-the-art estimators, and has the best query performance while being integrated into a query optimizer. In the second study, we focus on the dataset containing set-valued attributes and queries containing set containment search predicates. For the best of our knowledge, this is the first work which could handle queries involving point, range and set containment search predicates. In this study, we propose a selectivity estimator for dataset with setvalued attributes and queries with subset predicates and superset predicates. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for efficiently column factorization and query conversion, namely ST and ST-hist, and implement them into three estimators. In the third study, we focus on query workload drift problem on query-driven estimation. Query-driven estimators have poor performance if the distribution of target query workload is different with the distribution of source query workload. In this study, we first propose to take query template distribution as the query workload distribution and detect query workload drift with KL-divergence. We then propose a contrastive learning based estimators for query workload distribution adaptation if the query workload drift is detected, namely CLMSCN. CLMSCN conducts contrastive learning on source query workload and target query workload in the adaptation training stage, which improves the estimation accuracy on the target query workload. Doctor of Philosophy 2024-04-11T00:19:18Z 2024-04-11T00:19:18Z 2024 Thesis-Doctor of Philosophy Meng, Z. (2024). Enhancing the performance of selectivity estimation with machine learning techniques. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/174777 https://hdl.handle.net/10356/174777 10.32657/10356/174777 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Enhancing the performance of selectivity estimation with machine learning techniques

Similar Items