Generalized Data Thinning Using Sufficient Statistics
- URL: http://arxiv.org/abs/2303.12931v2
- Date: Sun, 11 Jun 2023 22:32:04 GMT
- Title: Generalized Data Thinning Using Sufficient Statistics
- Authors: Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela
Witten, Jacob Bien
- Abstract summary: A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X(1), ldots, X(K)$, such that $X = sum_k=1K X(k)$.
These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails.
In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$.
- Score: 2.3488056916440856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our goal is to develop a general strategy to decompose a random variable $X$
into multiple independent random variables, without sacrificing any information
about unknown parameters. A recent paper showed that for some well-known
natural exponential families, $X$ can be "thinned" into independent random
variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$.
These independent random variables can then be used for various model
validation and inference tasks, including in contexts where traditional sample
splitting fails. In this paper, we generalize their procedure by relaxing this
summation requirement and simply asking that some known function of the
independent random variables exactly reconstruct $X$. This generalization of
the procedure serves two purposes. First, it greatly expands the families of
distributions for which thinning can be performed. Second, it unifies sample
splitting and data thinning, which on the surface seem to be very different, as
applications of the same principle. This shared principle is sufficiency. We
use this insight to perform generalized thinning operations for a diverse set
of families.
Related papers
- Squared families: Searching beyond regular probability models [22.68738495315807]
Squared families are families of probability densities obtained by squaring a linear transformation of a statistic.
Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator.
The squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant.
arXiv Detail & Related papers (2025-03-27T03:39:35Z) - Near-Optimal Mean Estimation with Unknown, Heteroskedastic Variances [15.990720051907864]
Subset-of-Signals model serves as a benchmark for heteroskedastic mean estimation.
Our algorithm resolves this open question up to logarithmic factors.
Even for $d=2$, our techniques enable rates comparable to knowing the variance of each sample.
arXiv Detail & Related papers (2023-12-05T01:13:10Z) - A Robustness Analysis of Blind Source Separation [91.3755431537592]
Blind source separation (BSS) aims to recover an unobserved signal from its mixture $X=f(S)$ under the condition that the transformation $f$ is invertible but unknown.
We present a general framework for analysing such violations and quantifying their impact on the blind recovery of $S$ from $X$.
We show that a generic BSS-solution in response to general deviations from its defining structural assumptions can be profitably analysed in the form of explicit continuity guarantees.
arXiv Detail & Related papers (2023-03-17T16:30:51Z) - Universality laws for Gaussian mixtures in generalized linear models [22.154969876570238]
We investigate the joint statistics of the family of generalized linear estimators $(Theta_1, dots, Theta_M)$.
This allow us to prove the universality of different quantities of interest, such as the training and generalization errors.
We discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty.
arXiv Detail & Related papers (2023-02-17T15:16:06Z) - Learning and Covering Sums of Independent Random Variables with
Unbounded Support [4.458210211781738]
We study the problem of covering and learning sums $X = X_1 + cdots + X_n$ of independent integer-valued random variables $X_i$ with unbounded, or even infinite, support.
We show that the maximum value of the collective support of $X_i$'s necessarily appears in the sample complexity of learning $X$.
arXiv Detail & Related papers (2022-10-24T15:03:55Z) - Revealing Unobservables by Deep Learning: Generative Element Extraction
Networks (GEEN) [5.3028918247347585]
This paper proposes a novel method for estimating realizations of a latent variable $X*$ in a random sample.
To the best of our knowledge, this paper is the first to provide such identification in observation.
arXiv Detail & Related papers (2022-10-04T01:09:05Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Flexible mean field variational inference using mixtures of
non-overlapping exponential families [6.599344783327053]
I show that using standard mean field variational inference can fail to produce sensible results for models with sparsity-inducing priors.
I show that any mixture of a diffuse exponential family and a point mass at zero to model sparsity forms an exponential family.
arXiv Detail & Related papers (2020-10-14T01:46:56Z) - Contextuality scenarios arising from networks of stochastic processes [68.8204255655161]
An empirical model is said contextual if its distributions cannot be obtained marginalizing a joint distribution over X.
We present a different and classical source of contextual empirical models: the interaction among many processes.
The statistical behavior of the network in the long run makes the empirical model generically contextual and even strongly contextual.
arXiv Detail & Related papers (2020-06-22T16:57:52Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z) - Neural Bayes: A Generic Parameterization Method for Unsupervised
Representation Learning [175.34232468746245]
We introduce a parameterization method called Neural Bayes.
It allows computing statistical quantities that are in general difficult to compute.
We show two independent use cases for this parameterization.
arXiv Detail & Related papers (2020-02-20T22:28:53Z) - Algebraic and Analytic Approaches for Parameter Learning in Mixture
Models [66.96778152993858]
We present two different approaches for parameter learning in several mixture models in one dimension.
For some of these distributions, our results represent the first guarantees for parameter estimation.
arXiv Detail & Related papers (2020-01-19T05:10:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.