Related papers: A Bi-metric Framework for Fast Similarity Search

A Bi-metric Framework for Fast Similarity Search

URL: http://arxiv.org/abs/2406.02891v1
Date: Wed, 5 Jun 2024 03:17:48 GMT
Title: A Bi-metric Framework for Fast Similarity Search
Authors: Haike Xu, Sandeep Silwal, Piotr Indyk,
Abstract summary: We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. We show how to construct data structures using only the proxy metric, while only using a limited number of calls to both metrics.
Score: 23.254885582600775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.

Related papers

ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics. Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
Discovering Data Structures: Nearest Neighbor Search and Beyond [18.774836778996544]
We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. We first apply this framework to the problem of nearest neighbor search.
arXiv Detail & Related papers (2024-11-05T16:50:54Z)
Are We Really Achieving Better Beyond-Accuracy Performance in Next Basket Recommendation? [57.91114305844153]
Next basket recommendation (NBR) is a special type of sequential recommendation that is increasingly receiving attention. Recent studies into NBR have found a substantial performance difference between recommending repeat items and explore items. We propose a plug-and-play two-step repetition-exploration framework that treats repeat items and explores items separately.
arXiv Detail & Related papers (2024-05-02T09:59:35Z)
Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection [2.3814052021083354]
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy.
arXiv Detail & Related papers (2023-11-05T06:12:03Z)
Data-driven abstractions via adaptive refinements and a Kantorovich metric [extended version] [56.94699829208978]
We introduce an adaptive refinement procedure for smart, and scalable abstraction of dynamical systems. In order to learn the optimal structure, we define a Kantorovich-inspired metric between Markov chains. We show that our method yields a much better computational complexity than using classical linear programming techniques.
arXiv Detail & Related papers (2023-03-30T11:26:40Z)
Deep Active Ensemble Sampling For Image Classification [8.31483061185317]
Active learning frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points. Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches. We present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy. Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy.
arXiv Detail & Related papers (2022-10-11T20:20:20Z)
Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search [96.20505710087392]
We propose a Shapley value based method to evaluate operation contribution (Shapley-NAS) for neural architecture search. We show that our method outperforms the state-of-the-art methods by a considerable margin with light search cost.
arXiv Detail & Related papers (2022-06-20T14:41:49Z)
Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z)
Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification. Multiple global proxies are leveraged to collectively approximate the original data points for each class. We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z)
Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion [59.549664231655726]
A case-based reasoning (CBR) system solves a new problem by retrieving cases' that are similar to the given problem. In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs) Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB.
arXiv Detail & Related papers (2020-10-07T17:48:12Z)
Robust Similarity and Distance Learning via Decision Forests [8.587164648430251]
We propose a novel decision forest algorithm for the task of distance learning, which we call Similarity and Metric Random Forests (SMERF) Its ability to approximate arbitrary distances and identify important features is empirically demonstrated on simulated data sets.
arXiv Detail & Related papers (2020-07-27T20:17:42Z)
A Practical Index Structure Supporting Fr\'echet Proximity Queries Among Trajectories [1.9335262420787858]
We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics. Based on clustering for metric indexes, we obtain a dynamic tree structure whose size is linear in the number of trajectories. We analyze the efficiency and effectiveness of our methods with extensive experiments on diverse synthetic and real-world data sets.
arXiv Detail & Related papers (2020-05-28T04:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.