Related papers: Data fission: splitting a single data point

Data fission: splitting a single data point

URL: http://arxiv.org/abs/2112.11079v9
Date: Sun, 10 Dec 2023 19:04:32 GMT
Title: Data fission: splitting a single data point
Authors: James Leiner, Boyan Duan, Larry Wasserman, Aaditya Ramdas
Abstract summary: In this paper, we offer a more general methodology for achieving such a split in finite samples. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few applications, such as post-selection inference for trend filtering and other regression problems.
Score: 27.500860533521713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

Related papers

Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures [22.7776491836979]
A common approach to generative modeling is to split model-fitting into two blocks: define first how to sample noise (e.g. Gaussian) and choose next what to do with it. We explore in this alternative route that sampling and mapping. We find inspiration in moment measures, a result that states for any measure $mathbbRd$ there exists a unique potential $u$rho that $rho=nabla u,sharp,e-u$.
arXiv Detail & Related papers (2025-03-13T17:28:44Z)
Computational bottlenecks for denoising diffusions [8.05574597775852]
Denoising diffusions provide a general strategy to sample from a probability distribution $mu$ in $mathbbRd$ by constructing a process $(hatboldsymbol x_t:tge 0)$ in $mathbb Rd$ We show that any-time computable drift can be modified in a way that changes minimally the score matching objective and yet results in incorrect sampling.
arXiv Detail & Related papers (2025-03-11T04:21:01Z)
Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models [65.71506381302815]
We propose amortize the cost of sampling from a posterior distribution of the form $p(mathbfxmidmathbfy) propto p_theta(mathbfx)$. For many models and constraints of interest, the posterior in the noise space is smoother than the posterior in the data space, making it more amenable to such amortized inference.
arXiv Detail & Related papers (2025-02-10T19:49:54Z)
Characterizing the Distinguishability of Product Distributions through Multicalibration [9.695176684285832]
We prove a new, tight characterization of the number of samples $k$ needed to efficiently distinguish $X_0otimes k$ and $X_1otimes k$. Our framework can be used to re-derive a result of Halevi and Rabin (TCC 2008) and Geier (TCC 2022)
arXiv Detail & Related papers (2024-12-04T18:56:19Z)
Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality. We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z)
Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved. This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning. We propose a new learning paradigm that integrates both paired and unpaired data. Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Debiasing and a local analysis for population clustering using semidefinite programming [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. This work is motivated by the application of clustering individuals according to their population of origin.
arXiv Detail & Related papers (2024-01-16T03:14:24Z)
Testing with Non-identically Distributed Samples [20.74768558932617]
We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Even with $Theta(k/varepsilon2)$ samples from each distribution, $textbfp_mathrmavg$ is sufficient to learn $textbfp_mathrmavg$ to within error $varepsilon$ in TV distance.
arXiv Detail & Related papers (2023-11-19T01:25:50Z)
Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures [9.053430799456587]
We study the problem of learning nonparametric distributions in a finite mixture. We establish tight bounds on the sample complexity for learning the component distributions in such models.
arXiv Detail & Related papers (2022-03-28T23:53:48Z)
Locally differentially private estimation of nonlinear functionals of discrete distributions [9.028773906859541]
We study the problem of estimating non-linear functionals of discrete distributions in the context of local differential privacy. Only $alpha$-locally differentially private (LDP) samples are publicly available, where the term 'local' means that each $z_i$ is produced using one individual $x_i$. We describe the behavior of the quadratic risk for estimating the power sum functional $F_gamma = sum_k=1K p_kgamma$, $gamma >0$ as a function of $K,, n
arXiv Detail & Related papers (2021-07-08T16:11:10Z)
Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach [0.5330240017302619]
We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$. We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem. We show that our approach may be used to mitigate mode collapse in GANs by maximizing the entropy of the copula of fake samples.
arXiv Detail & Related papers (2021-02-25T21:21:40Z)
A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms. We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling" We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)
Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution. We show that the constraint of local differential privacy incurs an exponential increase in cost. Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.