A smoothed-Bayesian approach to frequency recovery from sketched data
- URL: http://arxiv.org/abs/2309.15408v3
- Date: Thu, 28 Nov 2024 13:46:38 GMT
- Title: A smoothed-Bayesian approach to frequency recovery from sketched data
- Authors: Mario Beraha, Stefano Favaro, Matteo Sesia,
- Abstract summary: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory.
We recover the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing.
- Score: 16.16733806935934
- License:
- Abstract: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.
Related papers
- Linear-cost unbiased posterior estimates for crossed effects and matrix factorization models via couplings [0.0]
We design and analyze unbiased Markov chain Monte Carlo schemes based on couplings of blocked Gibbs samplers (BGSs)
Our methodology is designed for and applicable to high-dimensional BGS with conditionally independent blocks.
arXiv Detail & Related papers (2024-10-11T16:05:01Z) - Efficient Fairness-Performance Pareto Front Computation [51.558848491038916]
We show that optimal fair representations possess several useful structural properties.
We then show that these approxing problems can be solved efficiently via concave programming methods.
arXiv Detail & Related papers (2024-09-26T08:46:48Z) - Bayesian Cramér-Rao Bound Estimation with Score-Based Models [3.4480437706804503]
The Bayesian Cram'er-Rao bound (CRB) provides a lower bound on the mean square error of any Bayesian estimator under mild regularity conditions.
This work introduces a new data-driven estimator for the CRB using score matching.
arXiv Detail & Related papers (2023-09-28T00:22:21Z) - Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise.
In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z) - Optimal Algorithms for the Inhomogeneous Spiked Wigner Model [89.1371983413931]
We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem.
We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random.
arXiv Detail & Related papers (2023-02-13T19:57:17Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - Robust leave-one-out cross-validation for high-dimensional Bayesian
models [0.0]
Leave-one-out cross-validation (LOO-CV) is a popular method for estimating out-of-sample predictive accuracy.
Here we propose and analyze a novel mixture estimator to compute LOO-CV criteria.
Our method retains the simplicity and computational convenience of classical approaches, while guaranteeing finite variance of the resulting estimators.
arXiv Detail & Related papers (2022-09-19T17:14:52Z) - Efficient semidefinite-programming-based inference for binary and
multi-class MRFs [83.09715052229782]
We propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF.
We extend semidefinite relaxations from the typical binary MRF to the full multi-class setting, and develop a compact semidefinite relaxation that can again be solved efficiently using the solver.
arXiv Detail & Related papers (2020-12-04T15:36:29Z) - Remaining Useful Life Estimation Under Uncertainty with Causal GraphNets [0.0]
A novel approach for the construction and training of time series models is presented.
The proposed method is appropriate for constructing predictive models for non-stationary time series.
arXiv Detail & Related papers (2020-11-23T21:28:03Z) - Frequency Estimation in Data Streams: Learning the Optimal Hashing
Scheme [3.7565501074323224]
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning.
The proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution.
We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.
arXiv Detail & Related papers (2020-07-17T22:15:22Z) - Learned Factor Graphs for Inference from Stationary Time Sequences [107.63351413549992]
We propose a framework that combines model-based algorithms and data-driven ML tools for stationary time sequences.
neural networks are developed to separately learn specific components of a factor graph describing the distribution of the time sequence.
We present an inference algorithm based on learned stationary factor graphs, which learns to implement the sum-product scheme from labeled data.
arXiv Detail & Related papers (2020-06-05T07:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.