Learning to Hash Robustly, with Guarantees
- URL: http://arxiv.org/abs/2108.05433v1
- Date: Wed, 11 Aug 2021 20:21:30 GMT
- Title: Learning to Hash Robustly, with Guarantees
- Authors: Alexandr Andoni, Daniel Beaglehole
- Abstract summary: In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
- Score: 79.68057056103014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The indexing algorithms for the high-dimensional nearest neighbor search
(NNS) with the best worst-case guarantees are based on the randomized Locality
Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic
approaches exist to "learn" the best indexing method in order to speed-up NNS,
crucially adapting to the structure of the given dataset.
Oftentimes, these heuristics outperform the LSH-based algorithms on real
datasets, but, almost always, come at the cost of losing the guarantees of
either correctness or robust performance on adversarial queries, or apply to
datasets with an assumed extra structure/model. In this paper, we design an NNS
algorithm for the Hamming space that has worst-case guarantees essentially
matching that of theoretical algorithms, while optimizing the hashing to the
structure of the dataset (think instance-optimal algorithms) for performance on
the minimum-performing query. We evaluate the algorithm's ability to optimize
for a given dataset both theoretically and practically. On the theoretical
side, we exhibit a natural setting (dataset model) where our algorithm is much
better than the standard theoretical one. On the practical side, we run
experiments that show that our algorithm has a 1.8x and 2.1x better recall on
the worst-performing queries to the MNIST and ImageNet datasets.
Related papers
- Adversarially robust clustering with optimality guarantees [7.0830450321883935]
We consider the problem of clustering data points coming from sub-Gaussian mixtures.
Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers.
We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present.
arXiv Detail & Related papers (2023-06-16T17:17:07Z) - Subset-Based Instance Optimality in Private Estimation [23.173651165908282]
We show how to construct private algorithms that achieve our notion of instance optimality when estimating a broad class of dataset properties.
Our algorithm simultaneously matches or exceeds the performance of existing algorithms under a range of distributional assumptions.
arXiv Detail & Related papers (2023-03-01T18:49:10Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Streaming Algorithms for High-Dimensional Robust Statistics [43.106438224356175]
We develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements.
Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model.
arXiv Detail & Related papers (2022-04-26T15:57:07Z) - Provably Faster Algorithms for Bilevel Optimization [54.83583213812667]
Bilevel optimization has been widely applied in many important machine learning applications.
We propose two new algorithms for bilevel optimization.
We show that both algorithms achieve the complexity of $mathcalO(epsilon-1.5)$, which outperforms all existing algorithms by the order of magnitude.
arXiv Detail & Related papers (2021-06-08T21:05:30Z) - Double Coverage with Machine-Learned Advice [100.23487145400833]
We study the fundamental online $k$-server problem in a learning-augmented setting.
We show that our algorithm achieves for any k an almost optimal consistency-robustness tradeoff.
arXiv Detail & Related papers (2021-03-02T11:04:33Z) - Benchmarking Simulation-Based Inference [5.3898004059026325]
Recent advances in probabilistic modelling have led to a large number of simulation-based inference algorithms which do not require numerical evaluation of likelihoods.
We provide a benchmark with inference tasks and suitable performance metrics, with an initial selection of algorithms.
We found that the choice of performance metric is critical, that even state-of-the-art algorithms have substantial room for improvement, and that sequential estimation improves sample efficiency.
arXiv Detail & Related papers (2021-01-12T18:31:22Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Efficient Sampling for Predictor-Based Neural Architecture Search [3.287802528135173]
We study predictor-based NAS algorithms for neural architecture search.
We show that the sample efficiency of predictor-based algorithms decreases dramatically if the proxy is only computed for a subset of the search space.
This is an important step to make predictor-based NAS algorithms useful, in practice.
arXiv Detail & Related papers (2020-11-24T11:36:36Z) - New Oracle-Efficient Algorithms for Private Synthetic Data Release [52.33506193761153]
We present three new algorithms for constructing differentially private synthetic data.
The algorithms satisfy differential privacy even in the worst case.
Compared to the state-of-the-art method High-Dimensional Matrix Mechanism citeMcKennaMHM18, our algorithms provide better accuracy in the large workload.
arXiv Detail & Related papers (2020-07-10T15:46:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.