Moreau-Yosida $f$-divergences
- URL: http://arxiv.org/abs/2102.13416v1
- Date: Fri, 26 Feb 2021 11:46:10 GMT
- Title: Moreau-Yosida $f$-divergences
- Authors: D\'avid Terj\'ek
- Abstract summary: Variational representations of $f$-divergences are central to many machine learning algorithms.
We generalize the so-called tight variational representation of $f$-divergences in the case of probability measures on compact metric spaces.
We provide an implementation of the variational formulas for the Kullback-Leibler, reverse Kullback-Leibler, $chi2$, reverse $chi2$, squared Hellinger, Jensen-Shannon, Jeffreys, triangular discrimination and total variation divergences.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variational representations of $f$-divergences are central to many machine
learning algorithms, with Lipschitz constrained variants recently gaining
attention. Inspired by this, we generalize the so-called tight variational
representation of $f$-divergences in the case of probability measures on
compact metric spaces to be taken over the space of Lipschitz functions
vanishing at an arbitrary base point, characterize functions achieving the
supremum in the variational representation, propose a practical algorithm to
calculate the tight convex conjugate of $f$-divergences compatible with
automatic differentiation frameworks, define the Moreau-Yosida approximation of
$f$-divergences with respect to the Wasserstein-$1$ metric, and derive the
corresponding variational formulas, providing a generalization of a number of
recent results, novel special cases of interest and a relaxation of the hard
Lipschitz constraint. As an application of our theoretical results, we propose
the Moreau-Yosida $f$-GAN, providing an implementation of the variational
formulas for the Kullback-Leibler, reverse Kullback-Leibler, $\chi^2$, reverse
$\chi^2$, squared Hellinger, Jensen-Shannon, Jeffreys, triangular
discrimination and total variation divergences as GANs trained on CIFAR-10,
leading to competitive results and a simple solution to the problem of
uniqueness of the optimal critic.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach [57.92727189589498]
We propose an online convex optimization approach with two different levels of adaptivity.
We obtain $mathcalO(log V_T)$, $mathcalO(d log V_T)$ and $hatmathcalO(sqrtV_T)$ regret bounds for strongly convex, exp-concave and convex loss functions.
arXiv Detail & Related papers (2023-07-17T09:55:35Z) - Quantum Rényi and $f$-divergences from integral representations [11.74020933567308]
Smooth Csisz'ar $f$-divergences can be expressed as integrals over so-called hockey stick divergences.
We find that the R'enyi divergences defined via our new quantum $f$-divergences are not additive in general.
We derive various inequalities, including new reverse Pinsker inequalities with applications in differential privacy.
arXiv Detail & Related papers (2023-06-21T15:39:38Z) - Multi-block-Single-probe Variance Reduced Estimator for Coupled
Compositional Optimization [49.58290066287418]
We propose a novel method named Multi-block-probe Variance Reduced (MSVR) to alleviate the complexity of compositional problems.
Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on strongity.
arXiv Detail & Related papers (2022-07-18T12:03:26Z) - $\alpha$-GAN: Convergence and Estimation Guarantees [7.493779672689531]
We prove a correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences.
We then focus on $alpha$-GAN, defined via the $alpha$-loss, which interpolates several GANs and corresponds to the minimization of the Arimoto divergence.
arXiv Detail & Related papers (2022-05-12T23:26:51Z) - Multiscale regression on unknown manifolds [13.752772802705978]
We construct low-dimensional coordinates on $mathcalM$ at multiple scales and perform multiscale regression by local fitting.
We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors.
Our algorithm has quasilinear complexity in the sample size, with constants linear in $D$ and exponential in $d$.
arXiv Detail & Related papers (2021-01-13T15:14:31Z) - Adaptive Online Estimation of Piecewise Polynomial Trends [23.91519151164528]
We consider the framework of non-stationary optimization with squared error losses and noisy gradient feedback.
Motivated from the theory of non-parametric regression, we introduce a new variational constraint.
We show that the same policy is minimax optimal for several other non-parametric families of interest.
arXiv Detail & Related papers (2020-09-30T19:30:28Z) - Variational Representations and Neural Network Estimation of R\'enyi
Divergences [4.2896536463351]
We derive a new variational formula for the R'enyi family of divergences, $R_alpha(Q|P)$, between probability measures $Q$ and $P$.
By applying this theory to neural-network estimators, we show that if a neural network family satisfies one of several strengthened versions of the universal approximation property then the corresponding R'enyi divergence estimator is consistent.
arXiv Detail & Related papers (2020-07-07T22:34:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.