Quantifying Attention Flow in Transformers
- URL: http://arxiv.org/abs/2005.00928v2
- Date: Sun, 31 May 2020 16:59:40 GMT
- Title: Quantifying Attention Flow in Transformers
- Authors: Samira Abnar and Willem Zuidema
- Abstract summary: "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer.
This makes attention weights unreliable as explanations probes.
We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
- Score: 12.197250533100283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the Transformer model, "self-attention" combines information from attended
embeddings into the representation of the focal embedding in the next layer.
Thus, across layers of the Transformer, information originating from different
tokens gets increasingly mixed. This makes attention weights unreliable as
explanations probes. In this paper, we consider the problem of quantifying this
flow of information through self-attention. We propose two methods for
approximating the attention to input tokens given attention weights, attention
rollout and attention flow, as post hoc methods when we use attention weights
as the relative relevance of the input tokens. We show that these methods give
complementary views on the flow of information, and compared to raw attention,
both yield higher correlations with importance scores of input tokens obtained
using an ablation method and input gradients.
Related papers
- Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision.
We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z) - Generic Attention-model Explainability by Weighted Relevance
Accumulation [9.816810016935541]
We propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance.
To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks.
arXiv Detail & Related papers (2023-08-20T12:02:30Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Measuring the Mixing of Contextual Information in the Transformer [0.19116784879310028]
We consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions.
Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions.
Experimentally, we show that our method, ALTI, provides faithful explanations and outperforms similar aggregation methods.
arXiv Detail & Related papers (2022-03-08T17:21:27Z) - Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool.
We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention.
We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z) - Centroid Transformers: Learning to Abstract with Attention [15.506293166377182]
Self-attention is a powerful mechanism for extracting features from the inputs.
We propose centroid attention, a generalization of self-attention that maps N inputs to M outputs $(Mleq N)$.
We apply our method to various applications, including abstractive text summarization, 3D vision, and image processing.
arXiv Detail & Related papers (2021-02-17T07:04:19Z) - Transformer Interpretability Beyond Attention Visualization [87.96102461221415]
Self-attention techniques, and specifically Transformers, are dominating the field of text processing.
In this work, we propose a novel way to compute relevancy for Transformer networks.
arXiv Detail & Related papers (2020-12-17T18:56:33Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Telling BERT's full story: from Local Attention to Global Aggregation [14.92157586545743]
We take a deep look into the behavior of self-attention heads in the transformer architecture.
We show that attention distributions can nevertheless provide insights into the local behavior of attention heads.
arXiv Detail & Related papers (2020-04-10T01:36:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.