Reliable, efficient and light distance computation on high-dimensional vectors

The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately...

Full description

Saved in:
Bibliographic Details
Main Author: Gao, Jianyang
Other Authors: Long Cheng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182813
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The great advances in deep learning enable human beings to extract the semantic information in ubiquitous unstructured data into high-dimensional vectors, leading to an urgent priority of managing vector data. Nearest neighbor (NN) query is a pivotal question in vector data management. Unfortunately, the NN query has been proved to be intrinsically hard. Researchers, thus, turn to approximate nearest neighbor (ANN) query which targets better efficiency by slightly compromising accuracy. Despite this, existing ANN methods still incur high time and space costs, which are mainly due to the high dimensionality. Specifically, in most in-memory ANN algorithms, the time cost is dominated by the time for comparing the distances among vectors. Each distance comparison operation (DCO) involves hundreds of arithmetic operations as each vector has hundreds of dimensions. Similarly, the space cost is dominated by the space for storing the vectors. Towards the goal of reducing the time and space costs, we present three methods in this thesis. First, we observe that, for a DCO, if an approximate distance can confirm that a vector is not a NN, then it can be discarded without computing exact distances. Based on this observation, we propose a method named ADSampling. For a DCO, it iteratively and incrementally samples a few dimensions and computes an approximate distance. When this distance is accurate enough to decide the result of the DCO, it terminates sampling. Second, to further improve the efficiency of DCOs, we design an efficient distance estimator. We propose a new quantization method named RaBitQ, which maps $D$-dimensional vectors into $D$-bit strings. RaBitQ provides an unbiased and asymptotically optimal estimator with good empirical accuracy. In addition, we introduce efficient implementations of RaBitQ based on bitwise operations or SIMD-based operations. Third, by extending the RaBitQ method, we develop an algorithm which compresses a floating-point vector into a $B$-bit unsigned integer vector. We prove that our algorithm is asymptotically optimal in terms of the space-accuracy trade-off.