Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search
- URL: http://arxiv.org/abs/2411.17229v1
- Date: Tue, 26 Nov 2024 08:51:46 GMT
- Title: Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search
- Authors: Liwei Deng, Penghao Chen, Ximu Zeng, Tianfu Wang, Yan Zhao, Kai Zheng,
- Abstract summary: High-dimensional approximate $K$ nearest neighbor search (AKNN) is a fundamental task for various applications.
We propose an underlineData-underlineAware underlineDistance underlineEstimation approach, called emphDADE.
- Score: 14.77572360618428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional approximate $K$ nearest neighbor search (AKNN) is a fundamental task for various applications, including information retrieval. Most existing algorithms for AKNN can be decomposed into two main components, i.e., candidate generation and distance comparison operations (DCOs). While different methods have unique ways of generating candidates, they all share the same DCO process. In this study, we focus on accelerating the process of DCOs that dominates the time cost in most existing AKNN algorithms. To achieve this, we propose an \underline{D}ata-\underline{A}ware \underline{D}istance \underline{E}stimation approach, called \emph{DADE}, which approximates the \emph{exact} distance in a lower-dimensional space. We theoretically prove that the distance estimation in \emph{DADE} is \emph{unbiased} in terms of data distribution. Furthermore, we propose an optimized estimation based on the unbiased distance estimation formulation. In addition, we propose a hypothesis testing approach to adaptively determine the number of dimensions needed to estimate the \emph{exact} distance with sufficient confidence. We integrate \emph{DADE} into widely-used AKNN search algorithms, e.g., \emph{IVF} and \emph{HNSW}, and conduct extensive experiments to demonstrate the superiority.
Related papers
- Relative-Translation Invariant Wasserstein Distance [82.6068808353647]
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$)
We show that $RW_p distances are also real distance metrics defined on the quotient set $mathcalP_p(mathbbRn)/sim$ invariant to distribution translations.
arXiv Detail & Related papers (2024-09-04T03:41:44Z) - RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search [11.069814476661827]
Cross-modal ANNS aims to use the data vector from one modality to retrieve the most similar items from another.
State-of-the-art ANNS approaches suffer poor performance for OOD workloads.
We propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution.
arXiv Detail & Related papers (2024-08-16T06:48:16Z) - Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors [58.661454334877256]
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery.
Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal.
We propose $k$NN-DTA, a non-representation embedding-based retrieval method adopted on a pre-trained DTA prediction model.
arXiv Detail & Related papers (2024-07-21T15:49:05Z) - Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance.
We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z) - A Near-Linear Time Algorithm for the Chamfer Distance [21.018781726524946]
Chamfer distance is a popular measure of dissimilarity between point clouds.
We present the first $(1+epsilon)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time.
Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets.
arXiv Detail & Related papers (2023-07-06T15:07:48Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Training Robust Deep Models for Time-Series Domain: Novel Algorithms and
Theoretical Analysis [32.45387153404849]
We propose a novel framework referred as RObust Training for Time-Series (RO-TS) to create robust DNNs for time-series classification tasks.
We show the generality and advantages of our formulation using the summation structure over time-series alignments.
Our experiments on real-world benchmarks demonstrate that RO-TS creates more robust DNNs when compared to adversarial training.
arXiv Detail & Related papers (2022-07-09T17:21:03Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency.
This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation.
In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z) - DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest
Neighbor Search [8.25574589820305]
We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN)
We apply ANN algorithms as a black box subroutine to compute an unbiased Density Estimation (KDE)
We show that our implementation outperforms state of the art implementations in all high dimensional datasets we considered.
arXiv Detail & Related papers (2021-07-06T17:11:28Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Nearest Neighbor Search Under Uncertainty [19.225091554227948]
Nearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning.
This paper studies NNS under Uncertainty (NNSU)
arXiv Detail & Related papers (2021-03-08T20:20:01Z) - Manifold learning with approximate nearest neighbors [1.8477401359673706]
We use a broad range of approximate nearest neighbor algorithms within manifold learning algorithms and evaluate their impact on embedding accuracy.
Via a thorough empirical investigation based on the benchmark MNIST dataset, it is shown that approximate nearest neighbors lead to substantial improvements in computational time.
This application demonstrates how the proposed methods can be used to visualize and identify anomalies and uncover underlying structure within high-dimensional data.
arXiv Detail & Related papers (2021-02-22T12:04:23Z) - Diverse Knowledge Distillation for End-to-End Person Search [81.4926655119318]
Person search aims to localize and identify a specific person from a gallery of images.
Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches.
We propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck.
arXiv Detail & Related papers (2020-12-21T09:04:27Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - An $\tilde{O}(n^{5/4})$ Time $\varepsilon$-Approximation Algorithm for
RMS Matching in a Plane [3.9596068699962315]
The 2-Wasserstein distance (or RMS distance) is a useful measure of similarity between probability distributions.
We present a new $varepsilon$-approximation algorithm that runs in $O(n5/4mathrmpolylog n,1/varepsilon)$ time.
arXiv Detail & Related papers (2020-07-15T14:47:25Z) - Deep Retrieval: Learning A Retrievable Structure for Large-Scale
Recommendations [21.68175843347951]
We present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data.
DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.
arXiv Detail & Related papers (2020-07-12T06:23:51Z) - Fast and Robust Comparison of Probability Measures in Heterogeneous
Spaces [62.35667646858558]
We introduce the Anchor Energy (AE) and Anchor Wasserstein (AW) distances, which are respectively the energy and Wasserstein distances instantiated on such representations.
Our main contribution is to propose a sweep line algorithm to compute AE emphexactly in log-quadratic time, where a naive implementation would be cubic.
We show that AE and AW perform well in various experimental settings at a fraction of the computational cost of popular GW approximations.
arXiv Detail & Related papers (2020-02-05T03:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.