Kernel Thinning
- URL: http://arxiv.org/abs/2105.05842v9
- Date: Wed, 7 Jun 2023 00:41:12 GMT
- Title: Kernel Thinning
- Authors: Raaz Dwivedi, Lester Mackey
- Abstract summary: kernel thinning is a new procedure for compressing a distribution $mathbbP$ more effectively than i.i.d. sampling or standard thinning.
We derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat'ern, and B-spline kernels.
- Score: 26.25415159542831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce kernel thinning, a new procedure for compressing a distribution
$\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given
a suitable reproducing kernel $\mathbf{k}_{\star}$ and $\mathcal{O}(n^2)$ time,
kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a
$\sqrt{n}$-point approximation with comparable worst-case integration error
across the associated reproducing kernel Hilbert space. The maximum discrepancy
in integration error is $\mathcal{O}_d(n^{-1/2}\sqrt{\log n})$ in probability
for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} (\log
n)^{(d+1)/2}\sqrt{\log\log n})$ for sub-exponential $\mathbb{P}$ on
$\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$
suffers $\Omega(n^{-1/4})$ integration error. Our sub-exponential guarantees
resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$
on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide
range of common kernels. Moreover, the same construction delivers near-optimal
$L^\infty$ coresets in $\mathcal O(n^2)$ time. We use our results to derive
explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern,
and B-spline kernels and present two vignettes illustrating the practical
benefits of kernel thinning over i.i.d. sampling and standard Markov chain
Monte Carlo thinning, in dimensions $d=2$ through $100$.
Related papers
- Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Classical shadows of fermions with particle number symmetry [0.0]
We provide an estimator for any $k$-RDM with $mathcalO(k2eta)$ classical complexity.
Our method, in the worst-case of half-filling, still provides a factor of $4k$ advantage in sample complexity.
arXiv Detail & Related papers (2022-08-18T17:11:12Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.