Bayesian nonparametric estimation of coverage probabilities and distinct
counts from sketched data
- URL: http://arxiv.org/abs/2209.02135v1
- Date: Mon, 5 Sep 2022 20:48:04 GMT
- Title: Bayesian nonparametric estimation of coverage probabilities and distinct
counts from sketched data
- Authors: Stefano Favaro, Matteo Sesia
- Abstract summary: We propose a nonparametric methodology to estimate coverage probabilities from data sketched through random hashing.
The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior.
The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
- Score: 6.510507449705344
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The estimation of coverage probabilities, and in particular of the missing
mass, is a classical statistical problem with applications in numerous
scientific fields. In this paper, we study this problem in relation to
randomized data compression, or sketching. This is a novel but practically
relevant perspective, and it refers to situations in which coverage
probabilities must be estimated based on a compressed and imperfect summary, or
sketch, of the true data, because neither the full data nor the empirical
frequencies of distinct symbols can be observed directly. Our contribution is a
Bayesian nonparametric methodology to estimate coverage probabilities from data
sketched through random hashing, which also solves the challenging problems of
recovering the numbers of distinct counts in the true data and of distinct
counts with a specified empirical frequency of interest. The proposed Bayesian
estimators are shown to be easily applicable to large-scale analyses in
combination with a Dirichlet process prior, although they involve some open
computational challenges under the more general Pitman-Yor process prior. The
empirical effectiveness of our methodology is demonstrated through numerical
experiments and applications to real data sets of Covid DNA sequences, classic
English literature, and IP addresses.
Related papers
- Bayesian Semi-supervised Inference via a Debiased Modeling Approach [1.2833734915643464]
Inference in semi-supervised (SS) settings has gained substantial attention in recent years due to increased relevance in modern big-data problems.<n>We propose a novel Bayesian method for estimating the population mean in SS settings.
arXiv Detail & Related papers (2025-09-22T06:49:10Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference
with Partially Labeled Data [10.021381302215062]
In real-world scenarios, data collection limitations often result in partially labeled datasets, leading to difficulties in drawing reliable causal inferences.
Traditional approaches in the semi-parametric (SS) and missing data literature may not adequately handle these complexities, leading to biased estimates.
This framework tackles missing outcomes in high-dimensional settings and accounts for selection bias.
arXiv Detail & Related papers (2023-05-22T07:37:12Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - MissDAG: Causal Discovery in the Presence of Missing Data with
Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations.
MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework.
We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Deep Probability Estimation [14.659180336823354]
We investigate probability estimation from high-dimensional data using deep neural networks.
The goal of this work is to investigate probability estimation from high-dimensional data using deep neural networks.
We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks.
arXiv Detail & Related papers (2021-11-21T03:55:50Z) - DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set.
Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z) - Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak.
Standard methods struggle to accommodate the partial observability and sparse data common at finer scales.
We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z) - Anomaly Detection in Trajectory Data with Normalizing Flows [0.0]
We propose an approach based on normalizing flows that enables complex density estimation from data with neural networks.
Our proposal computes exact model likelihood values, an important feature of normalizing flows, for each segment of the trajectory.
We evaluate our methodology, named aggregated anomaly detection with normalizing flows (GRADINGS), using real world trajectory data and compare it with more traditional anomaly detection techniques.
arXiv Detail & Related papers (2020-04-13T14:16:40Z) - A Robust Functional EM Algorithm for Incomplete Panel Count Data [66.07942227228014]
We propose a functional EM algorithm to estimate the counting process mean function under a missing completely at random assumption (MCAR)
The proposed algorithm wraps several popular panel count inference methods, seamlessly deals with incomplete counts and is robust to misspecification of the Poisson process assumption.
We illustrate the utility of the proposed algorithm through numerical experiments and an analysis of smoking cessation data.
arXiv Detail & Related papers (2020-03-02T20:04:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.