Related papers: A Mathematical Theory of Attention

A Mathematical Theory of Attention

URL: http://arxiv.org/abs/2007.02876v2
Date: Mon, 20 Jul 2020 13:57:49 GMT
Title: A Mathematical Theory of Attention
Authors: James Vuckovic, Aristide Baratin, Remi Tachet des Combes
Abstract summary: We build a mathematically equivalent model of attention using measure theory. We shed light on self-attention from a maximum entropy perspective. We then apply these insights to the problem of mis-specified input data.
Score: 11.766912556907158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.

Related papers

A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts [80.98474052840929]
Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention.<n>We show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts.
arXiv Detail & Related papers (2026-02-01T22:22:13Z)
Towards understanding how attention mechanism works in deep learning [8.79364699260219]
We study the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively.
arXiv Detail & Related papers (2024-12-24T08:52:06Z)
Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers [14.59741397670484]
We consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model. Experiments confirm our findings on both synthetic and real-world sequence classification tasks.
arXiv Detail & Related papers (2024-05-24T20:34:18Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool. We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention. We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
On the Regularity of Attention [11.703070372807293]
We propose a new mathematical framework that uses measure theory and integral operators to model attention. We show that this framework is consistent with the usual definition, and that it captures the essential properties of attention. We also discuss the effects regularity can have on NLP models, and applications to invertible and infinitely-deep networks.
arXiv Detail & Related papers (2021-02-10T18:40:11Z)
Focus of Attention Improves Information Transfer in Visual Features [80.22965663534556]
This paper focuses on unsupervised learning for transferring visual information in a truly online setting. The computation of the entropy terms is carried out by a temporal process which yields online estimation of the entropy terms. In order to better structure the input probability distribution, we use a human-like focus of attention model.
arXiv Detail & Related papers (2020-06-16T15:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.