How Smooth Is Attention?
- URL: http://arxiv.org/abs/2312.14820v2
- Date: Tue, 4 Jun 2024 15:51:36 GMT
- Title: How Smooth Is Attention?
- Authors: Valérie Castin, Pierre Ablin, Gabriel Peyré,
- Abstract summary: We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios.
We show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $sqrtn$ up to a constant factor.
Our mean-field framework for masked self-attention is novel and of independent interest.
- Score: 26.322030088685928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Related papers
- Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.
This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.
Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Stick-breaking Attention [38.492552119793]
Self-attention mechanism traditionally relies on the softmax operator.
Current methods using still face length generalisation challenges.
We propose an alternative attention mechanism based on the stick-breaking process.
arXiv Detail & Related papers (2024-10-23T15:51:13Z) - KPZ scaling from the Krylov space [83.88591755871734]
Recently, a superdiffusion exhibiting the Kardar-Parisi-Zhang scaling in late-time correlators and autocorrelators has been reported.
Inspired by these results, we explore the KPZ scaling in correlation functions using their realization in the Krylov operator basis.
arXiv Detail & Related papers (2024-06-04T20:57:59Z) - Causal Bandits with General Causal Models and Interventions [38.112806687145344]
This paper considers causal bandits (CBs) for the sequential design of interventions in a causal system.
The objective is to optimize a reward function via minimizing a measure of cumulative regret with respect to the best sequence of interventions in hindsight.
arXiv Detail & Related papers (2024-03-01T02:28:49Z) - Some Fundamental Aspects about Lipschitz Continuity of Neural Networks [6.576051895863941]
Lipschitz continuity is a crucial functional property of any predictive model.
We examine and characterise the Lipschitz behaviour of Neural Networks.
We show a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
arXiv Detail & Related papers (2023-02-21T18:59:40Z) - A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points [50.90125395570797]
This nearly establishes a conjecture ofciteSaundersonCPW12, within logarithmic factors.
The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.
arXiv Detail & Related papers (2022-12-21T17:48:01Z) - There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning
for Mazes [64.05903267230467]
Interpretability is an essential building block for trustworthiness in reinforcement learning systems.
We show that in certain cases, one can achieve policy interpretability while maintaining its optimality.
arXiv Detail & Related papers (2022-06-09T04:23:26Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z) - On Lipschitz Regularization of Convolutional Layers using Toeplitz
Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning.
computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard.
We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z) - The Lipschitz Constant of Self-Attention [27.61634862685452]
Lipschitz constants of neural networks have been explored in various contexts in deep learning.
We investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling.
arXiv Detail & Related papers (2020-06-08T16:08:38Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.