Related papers: Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

URL: http://arxiv.org/abs/2209.02135v1
Date: Mon, 5 Sep 2022 20:48:04 GMT
Title: Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data
Authors: Stefano Favaro, Matteo Sesia
Abstract summary: We propose a nonparametric methodology to estimate coverage probabilities from data sketched through random hashing. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
Score: 6.510507449705344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.

Related papers

Bayesian Semi-supervised Inference via a Debiased Modeling Approach [1.2833734915643464]
Inference in semi-supervised (SS) settings has gained substantial attention in recent years due to increased relevance in modern big-data problems.<n>We propose a novel Bayesian method for estimating the population mean in SS settings.
arXiv Detail & Related papers (2025-09-22T06:49:10Z)
Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies. We show that the likelihood of the available data has no local maxima. We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z)
The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference with Partially Labeled Data [10.021381302215062]
In real-world scenarios, data collection limitations often result in partially labeled datasets, leading to difficulties in drawing reliable causal inferences. Traditional approaches in the semi-parametric (SS) and missing data literature may not adequately handle these complexities, leading to biased estimates. This framework tackles missing outcomes in high-dimensional settings and accounts for selection bias.
arXiv Detail & Related papers (2023-05-22T07:37:12Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z)
Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains. Currently, most existing works rely exclusively on observational data. We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z)
Deep Probability Estimation [14.659180336823354]
We investigate probability estimation from high-dimensional data using deep neural networks. The goal of this work is to investigate probability estimation from high-dimensional data using deep neural networks. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks.
arXiv Detail & Related papers (2021-11-21T03:55:50Z)
DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set. Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z)
Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak. Standard methods struggle to accommodate the partial observability and sparse data common at finer scales. We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z)
Anomaly Detection in Trajectory Data with Normalizing Flows [0.0]
We propose an approach based on normalizing flows that enables complex density estimation from data with neural networks. Our proposal computes exact model likelihood values, an important feature of normalizing flows, for each segment of the trajectory. We evaluate our methodology, named aggregated anomaly detection with normalizing flows (GRADINGS), using real world trajectory data and compare it with more traditional anomaly detection techniques.
arXiv Detail & Related papers (2020-04-13T14:16:40Z)
A Robust Functional EM Algorithm for Incomplete Panel Count Data [66.07942227228014]
We propose a functional EM algorithm to estimate the counting process mean function under a missing completely at random assumption (MCAR) The proposed algorithm wraps several popular panel count inference methods, seamlessly deals with incomplete counts and is robust to misspecification of the Poisson process assumption. We illustrate the utility of the proposed algorithm through numerical experiments and an analysis of smoking cessation data.
arXiv Detail & Related papers (2020-03-02T20:04:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.