Quantifying Context Mixing in Transformers
- URL: http://arxiv.org/abs/2301.12971v1
- Date: Mon, 30 Jan 2023 15:19:02 GMT
- Title: Quantifying Context Mixing in Transformers
- Authors: Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupa{\l}a, Afra Alishahi
- Abstract summary: Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models.
We propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer.
- Score: 13.98583981770322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention weights and their transformed variants have been the main
source of information for analyzing token-to-token interactions in
Transformer-based models. But despite their ease of interpretation, these
weights are not faithful to the models' decisions as they are only one part of
an encoder, and other components in the encoder layer can have considerable
impact on information mixing in the output representations. In this work, by
expanding the scope of analysis to the whole encoder block, we propose Value
Zeroing, a novel context mixing score customized for Transformers that provides
us with a deeper understanding of how information is mixed at each encoder
layer. We demonstrate the superiority of our context mixing score over other
analysis methods through a series of complementary evaluations with different
viewpoints based on linguistically informed rationales, probing, and
faithfulness analysis.
Related papers
- Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Understanding Addition in Transformers [2.07180164747172]
This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition.
Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits.
arXiv Detail & Related papers (2023-10-19T19:34:42Z) - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers [6.405360669408265]
We propose a simple, new method to analyze encoder-decoder Transformers: DecoderLens.
Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers.
We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation.
arXiv Detail & Related papers (2023-10-05T17:04:59Z) - GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole
Encoder Layer in Transformers [19.642769560417904]
This paper introduces a novel token attribution analysis method that incorporates all the components in the encoder block and aggregates this throughout layers.
Our experiments reveal that incorporating almost every encoder component results in increasingly more accurate analysis in both local and global settings.
arXiv Detail & Related papers (2022-05-06T15:13:34Z) - Measuring the Mixing of Contextual Information in the Transformer [0.19116784879310028]
We consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions.
Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions.
Experimentally, we show that our method, ALTI, provides faithful explanations and outperforms similar aggregation methods.
arXiv Detail & Related papers (2022-03-08T17:21:27Z) - Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models [29.828669678974983]
We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
arXiv Detail & Related papers (2021-09-15T08:32:20Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Analyzing the Source and Target Contributions to Predictions in Neural
Machine Translation [97.22768624862111]
We analyze NMT models which explicitly evaluates the source and target relative contributions to the generation process.
We find that models trained with more data tend to rely on source information more and to have more sharp token contributions.
arXiv Detail & Related papers (2020-10-21T11:37:27Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.