Linear Time Sinkhorn Divergences using Positive Features
- URL: http://arxiv.org/abs/2006.07057v3
- Date: Mon, 26 Oct 2020 07:55:17 GMT
- Title: Linear Time Sinkhorn Divergences using Positive Features
- Authors: Meyer Scetbon and Marco Cuturi
- Abstract summary: Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
- Score: 51.50788603386766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Sinkhorn divergences are now routinely used in data sciences to
compare probability distributions, the computational effort required to compute
them remains expensive, growing in general quadratically in the size $n$ of the
support of these distributions. Indeed, solving optimal transport (OT) with an
entropic regularization requires computing a $n\times n$ kernel matrix (the
neg-exponential of a $n\times n$ pairwise ground cost matrix) that is
repeatedly applied to a vector. We propose to use instead ground costs of the
form $c(x,y)=-\log\dotp{\varphi(x)}{\varphi(y)}$ where $\varphi$ is a map from
the ground space onto the positive orthant $\RR^r_+$, with $r\ll n$. This
choice yields, equivalently, a kernel $k(x,y)=\dotp{\varphi(x)}{\varphi(y)}$,
and ensures that the cost of Sinkhorn iterations scales as $O(nr)$. We show
that usual cost functions can be approximated using this form. Additionaly, we
take advantage of the fact that our approach yields approximation that remain
fully differentiable with respect to input distributions, as opposed to
previously proposed adaptive low-rank approximations of the kernel matrix, to
train a faster variant of OT-GAN \cite{salimans2018improving}.
Related papers
- Upper Bounds for Learning in Reproducing Kernel Hilbert Spaces for Orbits of an Iterated Function System [1.1510009152620668]
A key problem in learning theory is to compute a function $f$ that closely approximates the relationship between some input $x$ and corresponding output $y$.
This approximation is based on sample points $(x_t,y_t)_t=1m$, where the function $f$ can be approximated within reproducing kernel Hilbert spaces.
arXiv Detail & Related papers (2024-10-10T20:34:22Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - A General Theory for Kernel Packets: from state space model to compactly supported basis [16.235214685688227]
We prove that an $m$-dimensional SS model formulation of GP is equivalent to a concept we introduce as the general right Kernel Packet (KP)
KPs improve GP prediction time to $mathcalO(log n)$ or $mathcalO(1)$, enable broader applications including GP's derivatives and kernel multiplications, and can be generalized to multi-dimensional additive and product kernels for scattered data.
arXiv Detail & Related papers (2024-02-06T14:12:46Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Kernel Thinning [26.25415159542831]
kernel thinning is a new procedure for compressing a distribution $mathbbP$ more effectively than i.i.d. sampling or standard thinning.
We derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat'ern, and B-spline kernels.
arXiv Detail & Related papers (2021-05-12T17:56:42Z) - Convergence of Sparse Variational Inference in Gaussian Processes
Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.