Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- URL: http://arxiv.org/abs/2302.00456v3
- Date: Mon, 15 Apr 2024 12:27:00 GMT
- Title: Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui,
- Abstract summary: We analyze the input contextualization effects of feed-forward (FF) blocks by rendering them in the attention maps as a human-friendly visualization scheme.
Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions.
- Score: 25.854447287448828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding [17.855998090452058]
We propose an efficient and effective multi-task visual grounding framework based on Transformer Decoder.
In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries.
In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores.
arXiv Detail & Related papers (2024-08-02T09:01:05Z) - Verb Conjugation in Transformers Is Determined by Linear Encodings of
Subject Number [24.248659219487976]
We show that BERT's ability to conjugate verbs relies on a linear encoding of subject number.
This encoding is found in the subject position at the first layer and the verb position at the last layer, but distributed across positions at middle layers.
arXiv Detail & Related papers (2023-10-23T17:53:47Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Demystify Self-Attention in Vision Transformers from a Semantic
Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing.
Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks.
This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z) - Improving Attention-Based Interpretability of Text Classification
Transformers [7.027858121801477]
We study the effectiveness of attention-based interpretability techniques for transformers in text classification.
We show that, with proper setup, attention may be used in such tasks with results comparable to state-of-the-art techniques.
arXiv Detail & Related papers (2022-09-22T09:19:22Z) - Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models [29.828669678974983]
We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
arXiv Detail & Related papers (2021-09-15T08:32:20Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.