Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- URL: http://arxiv.org/abs/2302.00456v3
- Date: Mon, 15 Apr 2024 12:27:00 GMT
- Title: Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui,
- Abstract summary: We analyze the input contextualization effects of feed-forward (FF) blocks by rendering them in the attention maps as a human-friendly visualization scheme.
Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions.
- Score: 25.854447287448828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer.
Related papers
- Verb Conjugation in Transformers Is Determined by Linear Encodings of
Subject Number [24.248659219487976]
We show that BERT's ability to conjugate verbs relies on a linear encoding of subject number.
This encoding is found in the subject position at the first layer and the verb position at the last layer, but distributed across positions at middle layers.
arXiv Detail & Related papers (2023-10-23T17:53:47Z) - VISIT: Visualizing and Interpreting the Semantic Information Flow of
Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary.
We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input.
We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers [99.22536338338011]
We study the spatial attentions of different backbone layers of semantic segmentation transformers.
We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds.
Our method is training-free and its computational overhead negligible.
arXiv Detail & Related papers (2022-12-29T18:07:56Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Demystify Self-Attention in Vision Transformers from a Semantic
Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing.
Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks.
This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z) - Improving Attention-Based Interpretability of Text Classification
Transformers [7.027858121801477]
We study the effectiveness of attention-based interpretability techniques for transformers in text classification.
We show that, with proper setup, attention may be used in such tasks with results comparable to state-of-the-art techniques.
arXiv Detail & Related papers (2022-09-22T09:19:22Z) - Deep Frequency Filtering for Domain Generalization [55.66498461438285]
Deep Neural Networks (DNNs) have preferences for some frequency components in the learning process.
We propose Deep Frequency Filtering (DFF) for learning domain-generalizable features.
We show that applying our proposed DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks.
arXiv Detail & Related papers (2022-03-23T05:19:06Z) - Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models [29.828669678974983]
We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
arXiv Detail & Related papers (2021-09-15T08:32:20Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.