Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models
- URL: http://arxiv.org/abs/2109.07152v1
- Date: Wed, 15 Sep 2021 08:32:20 GMT
- Title: Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models
- Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui
- Abstract summary: We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
- Score: 29.828669678974983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architecture has become ubiquitous in the natural language
processing field. To interpret the Transformer-based models, their attention
patterns have been extensively analyzed. However, the Transformer architecture
is not only composed of the multi-head attention; other components can also
contribute to Transformers' progressive performance. In this study, we extended
the scope of the analysis of Transformers from solely the attention patterns to
the whole attention block, i.e., multi-head attention, residual connection, and
layer normalization. Our analysis of Transformer-based masked language models
shows that the token-to-token interaction performed via attention has less
impact on the intermediate representations than previously assumed. These
results provide new intuitive explanations of existing reports; for example,
discarding the learned attention patterns tends not to adversely affect the
performance. The codes of our experiments are publicly available.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Transformers need glasses! Information over-squashing in language tasks [18.81066657470662]
We study how information propagates in decoder-only Transformers.
We show that certain sequences of inputs to the Transformer can yield arbitrarily close representations in the final token.
We also show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input.
arXiv Detail & Related papers (2024-06-06T17:14:44Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - A Meta-Learning Perspective on Transformers for Causal Language Modeling [17.293733942245154]
The Transformer architecture has become prominent in developing large causal language models.
We establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task.
Within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models.
arXiv Detail & Related papers (2023-10-09T17:27:36Z) - VISIT: Visualizing and Interpreting the Semantic Information Flow of
Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary.
We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input.
We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z) - Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps [25.854447287448828]
We analyze the input contextualization effects of feed-forward (FF) blocks by rendering them in the attention maps as a human-friendly visualization scheme.
Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions.
arXiv Detail & Related papers (2023-02-01T13:59:47Z) - Quantifying Context Mixing in Transformers [13.98583981770322]
Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models.
We propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer.
arXiv Detail & Related papers (2023-01-30T15:19:02Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.