Related papers: On the Regularity of Attention

On the Regularity of Attention

URL: http://arxiv.org/abs/2102.05628v1
Date: Wed, 10 Feb 2021 18:40:11 GMT
Title: On the Regularity of Attention
Authors: James Vuckovic, Aristide Baratin, Remi Tachet des Combes
Abstract summary: We propose a new mathematical framework that uses measure theory and integral operators to model attention. We show that this framework is consistent with the usual definition, and that it captures the essential properties of attention. We also discuss the effects regularity can have on NLP models, and applications to invertible and infinitely-deep networks.
Score: 11.703070372807293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention is a powerful component of modern neural networks across a wide variety of domains. In this paper, we seek to quantify the regularity (i.e. the amount of smoothness) of the attention operation. To accomplish this goal, we propose a new mathematical framework that uses measure theory and integral operators to model attention. We show that this framework is consistent with the usual definition, and that it captures the essential properties of attention. Then we use this framework to prove that, on compact domains, the attention operation is Lipschitz continuous and provide an estimate of its Lipschitz constant. Additionally, by focusing on a specific type of attention, we extend these Lipschitz continuity results to non-compact domains. We also discuss the effects regularity can have on NLP models, and applications to invertible and infinitely-deep networks.

Related papers

Localist LLMs -- A Mathematical Framework for Dynamic Locality Control [0.0]
Key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining.<n>We prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error.
arXiv Detail & Related papers (2025-10-10T12:44:59Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z)
Lipschitz Continuity Retained Binary Neural Network [52.17734681659175]
We introduce the Lipschitz continuity as the rigorous criteria to define the model robustness for BNN. We then propose to retain the Lipschitz continuity as a regularization term to improve the model robustness. Our experiments prove that our BNN-specific regularization method can effectively strengthen the robustness of BNN.
arXiv Detail & Related papers (2022-07-13T22:55:04Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Coarse-Grained Smoothness for RL in Metric Spaces [13.837098609529257]
A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity.
arXiv Detail & Related papers (2021-10-23T18:53:56Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
Deep Reinforced Attention Learning for Quality-Aware Visual Recognition [73.15276998621582]
We build upon the weakly-supervised generation mechanism of intermediate attention maps in any convolutional neural networks. We introduce a meta critic network to evaluate the quality of attention maps in the main network.
arXiv Detail & Related papers (2020-07-13T02:44:38Z)
A Mathematical Theory of Attention [11.766912556907158]
We build a mathematically equivalent model of attention using measure theory. We shed light on self-attention from a maximum entropy perspective. We then apply these insights to the problem of mis-specified input data.
arXiv Detail & Related papers (2020-07-06T16:42:24Z)
On Connections between Regularizations for Improving DNN Robustness [67.28077776415724]
This paper analyzes regularization terms proposed recently for improving the adversarial robustness of deep neural networks (DNNs) We study possible connections between several effective methods, including input-gradient regularization, Jacobian regularization, curvature regularization, and a cross-Lipschitz functional.
arXiv Detail & Related papers (2020-07-04T23:43:32Z)
The Lipschitz Constant of Self-Attention [27.61634862685452]
Lipschitz constants of neural networks have been explored in various contexts in deep learning. We investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling.
arXiv Detail & Related papers (2020-06-08T16:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.