Discovering Data Structures: Nearest Neighbor Search and Beyond
- URL: http://arxiv.org/abs/2411.03253v1
- Date: Tue, 05 Nov 2024 16:50:54 GMT
- Title: Discovering Data Structures: Nearest Neighbor Search and Beyond
- Authors: Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant,
- Abstract summary: We propose a general framework for end-to-end learning of data structures.
Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity.
We first apply this framework to the problem of nearest neighbor search.
- Score: 18.774836778996544
- License:
- Abstract: We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.
Related papers
- Coordinated Multi-Neighborhood Learning on a Directed Acyclic Graph [6.727984016678534]
Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence.
It is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions.
This paper develops a new constraint-based method for estimating the local structure around multiple user-specified target nodes.
arXiv Detail & Related papers (2024-05-24T08:49:43Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z) - Mathematical Models for Local Sensing Hashes [7.400475825464313]
We show that approximated index structures offer a good opportunity to accelerate the neighbor search for clustering and outlier detection.
We indicate directions to mathematically model the properties of local sensing hashes.
arXiv Detail & Related papers (2021-11-16T10:40:55Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - Evaluating Online and Offline Accuracy Traversal Algorithms for
k-Complete Neural Network Architectures [6.123324869194195]
In this paper, we study compact neural network architectures for binary classification.
We investigate improvements in speed and accuracy when favoring overcomplete architecture candidates.
arXiv Detail & Related papers (2021-01-16T20:37:29Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Deep Retrieval: Learning A Retrievable Structure for Large-Scale
Recommendations [21.68175843347951]
We present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data.
DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.
arXiv Detail & Related papers (2020-07-12T06:23:51Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z) - Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.