A Bi-metric Framework for Fast Similarity Search
- URL: http://arxiv.org/abs/2406.02891v1
- Date: Wed, 5 Jun 2024 03:17:48 GMT
- Title: A Bi-metric Framework for Fast Similarity Search
- Authors: Haike Xu, Sandeep Silwal, Piotr Indyk,
- Abstract summary: We propose a new "bi-metric" framework for designing nearest neighbor data structures.
Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate.
We show how to construct data structures using only the proxy metric, while only using a limited number of calls to both metrics.
- Score: 23.254885582600775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.
Related papers
- Data-driven abstractions via adaptive refinements and a Kantorovich
metric [extended version] [56.94699829208978]
We introduce an adaptive refinement procedure for smart, and scalable abstraction of dynamical systems.
In order to learn the optimal structure, we define a Kantorovich-inspired metric between Markov chains.
We show that our method yields a much better computational complexity than using classical linear programming techniques.
arXiv Detail & Related papers (2023-03-30T11:26:40Z) - Scalable Optimal Multiway-Split Decision Trees with Constraints [3.092691764363848]
We present a novel path-based MIP formulation where the number of decision variables is independent of $N$.
Our framework produces a multiway-split tree which is more interpretable than the typical binary-split trees due to its shorter rules.
We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points.
arXiv Detail & Related papers (2023-02-14T03:48:48Z) - Deep Active Ensemble Sampling For Image Classification [8.31483061185317]
Active learning frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points.
Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches.
We present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy.
Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy.
arXiv Detail & Related papers (2022-10-11T20:20:20Z) - Relational Proxies: Emergent Relationships as Fine-Grained
Discriminators [52.17542855760418]
We propose a novel approach that leverages information between the global and local part of an object for encoding its label.
We design Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets.
We also experimentally validate our theory and obtain consistent results across multiple benchmarks.
arXiv Detail & Related papers (2022-10-05T11:08:04Z) - Shapley-NAS: Discovering Operation Contribution for Neural Architecture
Search [96.20505710087392]
We propose a Shapley value based method to evaluate operation contribution (Shapley-NAS) for neural architecture search.
We show that our method outperforms the state-of-the-art methods by a considerable margin with light search cost.
arXiv Detail & Related papers (2022-06-20T14:41:49Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer
Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification.
Multiple global proxies are leveraged to collectively approximate the original data points for each class.
We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z) - Probabilistic Case-based Reasoning for Open-World Knowledge Graph
Completion [59.549664231655726]
A case-based reasoning (CBR) system solves a new problem by retrieving cases' that are similar to the given problem.
In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs)
Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB.
arXiv Detail & Related papers (2020-10-07T17:48:12Z) - Robust Similarity and Distance Learning via Decision Forests [8.587164648430251]
We propose a novel decision forest algorithm for the task of distance learning, which we call Similarity and Metric Random Forests (SMERF)
Its ability to approximate arbitrary distances and identify important features is empirically demonstrated on simulated data sets.
arXiv Detail & Related papers (2020-07-27T20:17:42Z) - A Practical Index Structure Supporting Fr\'echet Proximity Queries Among
Trajectories [1.9335262420787858]
We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics.
Based on clustering for metric indexes, we obtain a dynamic tree structure whose size is linear in the number of trajectories.
We analyze the efficiency and effectiveness of our methods with extensive experiments on diverse synthetic and real-world data sets.
arXiv Detail & Related papers (2020-05-28T04:12:43Z) - Metric Learning for Ordered Labeled Trees with pq-grams [11.284638114256712]
We propose a new metric learning approach for tree-structured data with pq-grams.
The pq-gram distance is a distance for ordered labeled trees, and has much lower computation cost than the tree edit distance.
We empirically show that the proposed approach achieves competitive results with the state-of-the-art edit distance-based methods.
arXiv Detail & Related papers (2020-03-09T08:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.