Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search
- URL: http://arxiv.org/abs/2411.17229v1
- Date: Tue, 26 Nov 2024 08:51:46 GMT
- Title: Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search
- Authors: Liwei Deng, Penghao Chen, Ximu Zeng, Tianfu Wang, Yan Zhao, Kai Zheng,
- Abstract summary: High-dimensional approximate $K$ nearest neighbor search (AKNN) is a fundamental task for various applications.
We propose an underlineData-underlineAware underlineDistance underlineEstimation approach, called emphDADE.
- Score: 14.77572360618428
- License:
- Abstract: High-dimensional approximate $K$ nearest neighbor search (AKNN) is a fundamental task for various applications, including information retrieval. Most existing algorithms for AKNN can be decomposed into two main components, i.e., candidate generation and distance comparison operations (DCOs). While different methods have unique ways of generating candidates, they all share the same DCO process. In this study, we focus on accelerating the process of DCOs that dominates the time cost in most existing AKNN algorithms. To achieve this, we propose an \underline{D}ata-\underline{A}ware \underline{D}istance \underline{E}stimation approach, called \emph{DADE}, which approximates the \emph{exact} distance in a lower-dimensional space. We theoretically prove that the distance estimation in \emph{DADE} is \emph{unbiased} in terms of data distribution. Furthermore, we propose an optimized estimation based on the unbiased distance estimation formulation. In addition, we propose a hypothesis testing approach to adaptively determine the number of dimensions needed to estimate the \emph{exact} distance with sufficient confidence. We integrate \emph{DADE} into widely-used AKNN search algorithms, e.g., \emph{IVF} and \emph{HNSW}, and conduct extensive experiments to demonstrate the superiority.
Related papers
- Relative-Translation Invariant Wasserstein Distance [82.6068808353647]
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$)
We show that $RW_p distances are also real distance metrics defined on the quotient set $mathcalP_p(mathbbRn)/sim$ invariant to distribution translations.
arXiv Detail & Related papers (2024-09-04T03:41:44Z) - Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors [58.661454334877256]
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery.
Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal.
We propose $k$NN-DTA, a non-representation embedding-based retrieval method adopted on a pre-trained DTA prediction model.
arXiv Detail & Related papers (2024-07-21T15:49:05Z) - A Near-Linear Time Algorithm for the Chamfer Distance [21.018781726524946]
Chamfer distance is a popular measure of dissimilarity between point clouds.
We present the first $(1+epsilon)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time.
Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets.
arXiv Detail & Related papers (2023-07-06T15:07:48Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Training Robust Deep Models for Time-Series Domain: Novel Algorithms and
Theoretical Analysis [32.45387153404849]
We propose a novel framework referred as RObust Training for Time-Series (RO-TS) to create robust DNNs for time-series classification tasks.
We show the generality and advantages of our formulation using the summation structure over time-series alignments.
Our experiments on real-world benchmarks demonstrate that RO-TS creates more robust DNNs when compared to adversarial training.
arXiv Detail & Related papers (2022-07-09T17:21:03Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest
Neighbor Search [8.25574589820305]
We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN)
We apply ANN algorithms as a black box subroutine to compute an unbiased Density Estimation (KDE)
We show that our implementation outperforms state of the art implementations in all high dimensional datasets we considered.
arXiv Detail & Related papers (2021-07-06T17:11:28Z) - Nearest Neighbor Search Under Uncertainty [19.225091554227948]
Nearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning.
This paper studies NNS under Uncertainty (NNSU)
arXiv Detail & Related papers (2021-03-08T20:20:01Z) - An $\tilde{O}(n^{5/4})$ Time $\varepsilon$-Approximation Algorithm for
RMS Matching in a Plane [3.9596068699962315]
The 2-Wasserstein distance (or RMS distance) is a useful measure of similarity between probability distributions.
We present a new $varepsilon$-approximation algorithm that runs in $O(n5/4mathrmpolylog n,1/varepsilon)$ time.
arXiv Detail & Related papers (2020-07-15T14:47:25Z) - Fast and Robust Comparison of Probability Measures in Heterogeneous
Spaces [62.35667646858558]
We introduce the Anchor Energy (AE) and Anchor Wasserstein (AW) distances, which are respectively the energy and Wasserstein distances instantiated on such representations.
Our main contribution is to propose a sweep line algorithm to compute AE emphexactly in log-quadratic time, where a naive implementation would be cubic.
We show that AE and AW perform well in various experimental settings at a fraction of the computational cost of popular GW approximations.
arXiv Detail & Related papers (2020-02-05T03:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.