How many moments does MMD compare?
- URL: http://arxiv.org/abs/2106.14277v1
- Date: Sun, 27 Jun 2021 16:44:17 GMT
- Title: How many moments does MMD compare?
- Authors: Rustem Takhanov
- Abstract summary: mathcalF-1$ acts on smooth functions in the same way as an integral operator associated with $K$.
We show that kernels defined by pseudo-differential operators are able to approximate uniformly any continuous Mercer kernel on a compact set.
- Score: 7.919213739992465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new way of study of Mercer kernels, by corresponding to a
special kernel $K$ a pseudo-differential operator $p({\mathbf x}, D)$ such that
$\mathcal{F} p({\mathbf x}, D)^\dag p({\mathbf x}, D) \mathcal{F}^{-1}$ acts on
smooth functions in the same way as an integral operator associated with $K$
(where $\mathcal{F}$ is the Fourier transform). We show that kernels defined by
pseudo-differential operators are able to approximate uniformly any continuous
Mercer kernel on a compact set.
The symbol $p({\mathbf x}, {\mathbf y})$ encapsulates a lot of useful
information about the structure of the Maximum Mean Discrepancy distance
defined by the kernel $K$. We approximate $p({\mathbf x}, {\mathbf y})$ with
the sum of the first $r$ terms of the Singular Value Decomposition of $p$,
denoted by $p_r({\mathbf x}, {\mathbf y})$. If ordered singular values of the
integral operator associated with $p({\mathbf x}, {\mathbf y})$ die down
rapidly, the MMD distance defined by the new symbol $p_r$ differs from the
initial one only slightly. Moreover, the new MMD distance can be interpreted as
an aggregated result of comparing $r$ local moments of two probability
distributions.
The latter results holds under the condition that right singular vectors of
the integral operator associated with $p$ are uniformly bounded. But even if
this is not satisfied we can still hold that the Hilbert-Schmidt distance
between $p$ and $p_r$ vanishes. Thus, we report an interesting phenomenon: the
MMD distance measures the difference of two probability distributions with
respect to a certain number of local moments, $r^\ast$, and this number
$r^\ast$ depends on the speed with which singular values of $p$ die down.
Related papers
- Near-Optimal and Tractable Estimation under Shift-Invariance [0.21756081703275998]
Class of all such signals is but extremely rich: it contains all exponential oscillations over $mathbbCn$ with total degree $s$.
We show that the statistical complexity of this class, as measured by the radius squared minimax frequencies of the $(delta)$-confidence $ell$-ball, is nearly the same as for the class of $s$-sparse signals, namely $Oleft(slog(en) + log(delta-1)right) cdot log(en/s)
arXiv Detail & Related papers (2024-11-05T18:11:23Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Strong uniform convergence of Laplacians of random geometric and
directed kNN graphs on compact manifolds [0.0]
We study the almost sure uniform convergence of this operator to the diffusive Laplace-Beltrami operator when $n$ tends to infinity.
This work extends known results of the past 15 years.
arXiv Detail & Related papers (2022-12-20T14:31:06Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Local approximation of operators [0.0]
We study the problem of determining the degree of approximation of a non-linear operator between metric spaces $mathfrakX$ and $mathfrakY$.
We establish constructive methods to do this efficiently, i.e., with the constants involved in the estimates on the approximation on $mathbbSd$ being $mathcalO(d1/6)$.
arXiv Detail & Related papers (2022-02-13T19:28:34Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Kernel Thinning [26.25415159542831]
kernel thinning is a new procedure for compressing a distribution $mathbbP$ more effectively than i.i.d. sampling or standard thinning.
We derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat'ern, and B-spline kernels.
arXiv Detail & Related papers (2021-05-12T17:56:42Z) - From Smooth Wasserstein Distance to Dual Sobolev Norm: Empirical
Approximation and Statistical Applications [18.618590805279187]
We show that $mathsfW_p(sigma)$ is controlled by a $pth order smooth dual Sobolev $mathsfd_p(sigma)$.
We derive the limit distribution of $sqrtnmathsfd_p(sigma)(hatmu_n,mu)$ in all dimensions $d$, when $mu$ is sub-Gaussian.
arXiv Detail & Related papers (2021-01-11T17:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.