Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers
- URL: http://arxiv.org/abs/2507.07814v1
- Date: Thu, 10 Jul 2025 14:45:31 GMT
- Title: Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers
- Authors: Nikolay Yudin, Alexander Gaponov, Sergei Kudriashov, Maxim Rakhuba,
- Abstract summary: We present a novel local Lipschitz bound for self-attention blocks of transformers.<n>We suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective.
- Score: 41.94295877935867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel local Lipschitz bound for self-attention blocks of transformers. This bound is based on a refined closed-form expression for the spectral norm of the softmax function. The resulting bound is not only more accurate than in the prior art, but also unveils the dependence of the Lipschitz constant on attention score maps. Based on the new findings, we suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective. We also introduce a new lightweight regularization term called JaSMin (Jacobian Softmax norm Minimization), which boosts the transformer's robustness and decreases local Lipschitz constants of the whole network.
Related papers
- MIQCQP reformulation of the ReLU neural networks Lipschitz constant
estimation problem [0.0]
We propose new quadratically constrained MIP formulations for the neural network Lipschitz estimation problem.
The solutions of these problems give lower bounds and upper bounds of the Lipschitz constant.
We detail conditions when they coincide with the exact Lipschitz constant.
arXiv Detail & Related papers (2024-02-02T07:55:42Z) - Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted
Activations [52.031701581294804]
Lipschitz bounds for neural networks can be computed with upper time preservation guarantees.
Our paper bridges the gap and extends Lipschitz beyond slope-restricted activation functions.
Our proposed analysis is general and provides a unified approach for estimating $ell$ and $ell_infty$ Lipschitz bounds.
arXiv Detail & Related papers (2024-01-25T09:23:31Z) - Some Fundamental Aspects about Lipschitz Continuity of Neural Networks [6.576051895863941]
Lipschitz continuity is a crucial functional property of any predictive model.
We examine and characterise the Lipschitz behaviour of Neural Networks.
We show a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
arXiv Detail & Related papers (2023-02-21T18:59:40Z) - Efficiently Computing Local Lipschitz Constants of Neural Networks via
Bound Propagation [79.13041340708395]
Lipschitz constants are connected to many properties of neural networks, such as robustness, fairness, and generalization.
Existing methods for computing Lipschitz constants either produce relatively loose upper bounds or are limited to small networks.
We develop an efficient framework for computing the $ell_infty$ local Lipschitz constant of a neural network by tightly upper bounding the norm of Clarke Jacobian.
arXiv Detail & Related papers (2022-10-13T22:23:22Z) - Chordal Sparsity for Lipschitz Constant Estimation of Deep Neural
Networks [77.82638674792292]
Lipschitz constants of neural networks allow for guarantees of robustness in image classification, safety in controller design, and generalizability beyond the training data.
As calculating Lipschitz constants is NP-hard, techniques for estimating Lipschitz constants must navigate the trade-off between scalability and accuracy.
In this work, we significantly push the scalability frontier of a semidefinite programming technique known as LipSDP while achieving zero accuracy loss.
arXiv Detail & Related papers (2022-04-02T11:57:52Z) - On Lipschitz Regularization of Convolutional Layers using Toeplitz
Matrix Theory [77.18089185140767]
Lipschitz regularity is established as a key property of modern deep learning.
computing the exact value of the Lipschitz constant of a neural network is known to be NP-hard.
We introduce a new upper bound for convolutional layers that is both tight and easy to compute.
arXiv Detail & Related papers (2020-06-15T13:23:34Z) - The Lipschitz Constant of Self-Attention [27.61634862685452]
Lipschitz constants of neural networks have been explored in various contexts in deep learning.
We investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling.
arXiv Detail & Related papers (2020-06-08T16:08:38Z) - Exactly Computing the Local Lipschitz Constant of ReLU Networks [98.43114280459271]
The local Lipschitz constant of a neural network is a useful metric for robustness, generalization, and fairness evaluation.
We show strong inapproximability results for estimating Lipschitz constants of ReLU networks.
We leverage this algorithm to evaluate the tightness of competing Lipschitz estimators and the effects of regularized training on the Lipschitz constant.
arXiv Detail & Related papers (2020-03-02T22:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.