Related papers: Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

URL: http://arxiv.org/abs/2007.09261v2
Date: Wed, 2 Jun 2021 04:38:39 GMT
Title: Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme
Authors: Dimitris Bertsimas and Vassilis Digalakis Jr
Abstract summary: We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. The proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.
Score: 3.7565501074323224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.

Related papers

Building Conformal Prediction Intervals with Approximate Message Passing [14.951392270119461]
Conformal prediction is a powerful tool for building prediction intervals that are valid in a distribution-free way. We propose a novel algorithm based on Approximate Message Passing (AMP) to accelerate the computation of prediction intervals. We show that our method produces prediction intervals that are close to the baseline methods, while being orders of magnitude faster.
arXiv Detail & Related papers (2024-10-21T20:34:33Z)
Distributed Markov Chain Monte Carlo Sampling based on the Alternating Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers. We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art. In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z)
Stochastic optimization with arbitrary recurrent data sampling [2.1485350418225244]
Most commonly used data sampling algorithms are under mild assumptions. We show that for a particular class of recurrent optimization algorithms, we do not need any other property. We show that convergence can be accelerated by selecting sampling algorithms that cover the data set.
arXiv Detail & Related papers (2024-01-15T14:04:50Z)
A smoothed-Bayesian approach to frequency recovery from sketched data [16.16733806935934]
We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory. We recover the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing.
arXiv Detail & Related papers (2023-09-27T05:20:53Z)
Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z)
Low-rank extended Kalman filtering for online learning of neural networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix. In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z)
Optimal Algorithms for the Inhomogeneous Spiked Wigner Model [89.1371983413931]
We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random.
arXiv Detail & Related papers (2023-02-13T19:57:17Z)
Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Matrix completion based on Gaussian belief propagation [5.685589351789462]
We develop a message-passing algorithm for noisy matrix completion problems based on matrix factorization. We derive a memory-friendly version of the proposed algorithm by applying a perturbation treatment commonly used in the literature of approximate message passing. Experiments on synthetic datasets show that while the proposed algorithm quantitatively exhibits almost the same performance under settings where the earlier algorithm is optimal, it is advantageous when the observed datasets are corrupted by non-Gaussian noise.
arXiv Detail & Related papers (2021-05-01T12:16:49Z)
Sparse Algorithms for Markovian Gaussian Processes [18.999495374836584]
Sparse Markovian processes combine the use of inducing variables with efficient Kalman filter-likes recursion. We derive a general site-based approach to approximate the non-Gaussian likelihood with local Gaussian terms, called sites. Our approach results in a suite of novel sparse extensions to algorithms from both the machine learning and signal processing, including variational inference, expectation propagation, and the classical nonlinear Kalman smoothers. The derived methods are suited to literature-temporal data, where the model has separate inducing points in both time and space.
arXiv Detail & Related papers (2021-03-19T09:50:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.