Sparse Continuous Distributions and Fenchel-Young Losses
- URL: http://arxiv.org/abs/2108.01988v1
- Date: Wed, 4 Aug 2021 12:07:18 GMT
- Title: Sparse Continuous Distributions and Fenchel-Young Losses
- Authors: Andr\'e F. T. Martins, Marcos Treviso, Ant\'onio Farinhas, Pedro M. Q.
Aguiar, M\'ario A. T. Figueiredo, Mathieu Blondel and Vlad Niculae
- Abstract summary: We extend $Omega$-regularized prediction maps and Fenchel-Young losses to arbitrary domains.
For quadratic energy functions in continuous domains, the resulting densities are $beta$-Gaussians.
We demonstrate our sparse continuous distributions for attention-based audio classification and visual question answering.
- Score: 28.52737451408056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exponential families are widely used in machine learning; they include many
distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet,
Poisson, and categorical distributions via the softmax transformation).
Distributions in each of these families have fixed support. In contrast, for
finite domains, there has been recent works on sparse alternatives to softmax
(e.g. sparsemax, $\alpha$-entmax, and fusedmax) and corresponding losses, which
have varying support.
This paper expands that line of work in several directions: first, it extends
$\Omega$-regularized prediction maps and Fenchel-Young losses to arbitrary
domains (possibly countably infinite or continuous). For linearly parametrized
families, we show that minimization of Fenchel-Young losses is equivalent to
moment matching of the statistics, generalizing a fundamental property of
exponential families. When $\Omega$ is a Tsallis negentropy with parameter
$\alpha$, we obtain "deformed exponential families," which include
$\alpha$-entmax and sparsemax ($\alpha$ = 2) as particular cases. For quadratic
energy functions in continuous domains, the resulting densities are
$\beta$-Gaussians, an instance of elliptical distributions that contain as
particular cases the Gaussian, biweight, triweight and Epanechnikov densities,
and for which we derive closed-form expressions for the variance, Tsallis
entropy, and Fenchel-Young loss. When $\Omega$ is a total variation or Sobolev
regularizer, we obtain a continuous version of the fusedmax. Finally, we
introduce continuous-domain attention mechanisms, deriving efficient gradient
backpropagation algorithms for $\alpha \in \{1, 4/3, 3/2, 2\}$. Using them, we
demonstrate our sparse continuous distributions for attention-based audio
classification and visual question answering, showing that they allow attending
to time intervals and compact regions.
Related papers
- Generalized Regret Analysis of Thompson Sampling using Fractional
Posteriors [12.43000662545423]
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve multi-armed bandit problems.
We consider a variant of TS, named $alpha$-TS, where we use a fractional or $alpha$-posterior instead of the standard posterior distribution.
arXiv Detail & Related papers (2023-09-12T16:15:33Z) - Nearly $d$-Linear Convergence Bounds for Diffusion Models via Stochastic
Localization [40.808942894229325]
We provide the first convergence bounds which are linear in the data dimension.
We show that diffusion models require at most $tilde O(fracd log2(1/delta)varepsilon2)$ steps to approximate an arbitrary distribution.
arXiv Detail & Related papers (2023-08-07T16:01:14Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Sparse and Continuous Attention Mechanisms [14.941013982958209]
We introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for alpha in 1,2.
Experiments on attention-based text classification, machine translation, and visual question answering illustrate the use of continuous attention in 1D and 2D.
arXiv Detail & Related papers (2020-06-12T14:16:48Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and
Non-Asymptotic Concentration [115.1954841020189]
We study the inequality and non-asymptotic properties of approximation procedures with Polyak-Ruppert averaging.
We prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity.
arXiv Detail & Related papers (2020-04-09T17:54:18Z) - The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications
for Profile Maximum Likelihood [33.51964370430905]
We provide new bounds on the quality of approximation of the Bethe and Sinkhorn permanents.
We establish a surprising connection between the convex relaxation in prior work and the well-studied Bethe and Sinkhorn approximations.
arXiv Detail & Related papers (2020-04-06T06:40:03Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.