Data fission: splitting a single data point
- URL: http://arxiv.org/abs/2112.11079v9
- Date: Sun, 10 Dec 2023 19:04:32 GMT
- Title: Data fission: splitting a single data point
- Authors: James Leiner, Boyan Duan, Larry Wasserman, Aaditya Ramdas
- Abstract summary: In this paper, we offer a more general methodology for achieving such a split in finite samples.
We call our method data fission, as an alternative to data splitting, data carving and p-value masking.
We exemplify the method on a few applications, such as post-selection inference for trend filtering and other regression problems.
- Score: 27.500860533521713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Suppose we observe a random vector $X$ from some distribution $P$ in a known
family with unknown parameters. We ask the following question: when is it
possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part
is sufficient to reconstruct $X$ by itself, but both together can recover $X$
fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one
example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any
$m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and
$g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative
approach that uses additive Gaussian noise -- this enables post-selection
inference in finite samples for Gaussian distributed data and asymptotically
when errors are non-Gaussian. In this paper, we offer a more general
methodology for achieving such a split in finite samples by borrowing ideas
from Bayesian inference to yield a (frequentist) solution that can be viewed as
a continuous analog of data splitting. We call our method data fission, as an
alternative to data splitting, data carving and p-value masking. We exemplify
the method on a few prototypical applications, such as post-selection inference
for trend filtering and other regression problems.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.
We propose a new learning paradigm that integrates both paired and unpaired data.
Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Debiasing and a local analysis for population clustering using
semidefinite programming [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions.
This work is motivated by the application of clustering individuals according to their population of origin.
arXiv Detail & Related papers (2024-01-16T03:14:24Z) - Testing with Non-identically Distributed Samples [20.74768558932617]
We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed.
Even with $Theta(k/varepsilon2)$ samples from each distribution, $textbfp_mathrmavg$ is sufficient to learn $textbfp_mathrmavg$ to within error $varepsilon$ in TV distance.
arXiv Detail & Related papers (2023-11-19T01:25:50Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures [9.053430799456587]
We study the problem of learning nonparametric distributions in a finite mixture.
We establish tight bounds on the sample complexity for learning the component distributions in such models.
arXiv Detail & Related papers (2022-03-28T23:53:48Z) - Locally differentially private estimation of nonlinear functionals of
discrete distributions [9.028773906859541]
We study the problem of estimating non-linear functionals of discrete distributions in the context of local differential privacy.
Only $alpha$-locally differentially private (LDP) samples are publicly available, where the term 'local' means that each $z_i$ is produced using one individual $x_i$.
We describe the behavior of the quadratic risk for estimating the power sum functional $F_gamma = sum_k=1K p_kgamma$, $gamma >0$ as a function of $K,, n
arXiv Detail & Related papers (2021-07-08T16:11:10Z) - Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula
Approach [0.5330240017302619]
We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$.
We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem.
We show that our approach may be used to mitigate mode collapse in GANs by maximizing the entropy of the copula of fake samples.
arXiv Detail & Related papers (2021-02-25T21:21:40Z) - A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms.
We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling"
We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.