Measuring the Mixing of Contextual Information in the Transformer
- URL: http://arxiv.org/abs/2203.04212v1
- Date: Tue, 8 Mar 2022 17:21:27 GMT
- Title: Measuring the Mixing of Contextual Information in the Transformer
- Authors: Javier Ferrando, Gerard I. G\'allego and Marta R. Costa-juss\`a
- Abstract summary: We consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions.
Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions.
Experimentally, we show that our method, ALTI, provides faithful explanations and outperforms similar aggregation methods.
- Score: 0.19116784879310028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer architecture aggregates input information through the
self-attention mechanism, but there is no clear understanding of how this
information is mixed across the entire model. Additionally, recent works have
demonstrated that attention weights alone are not enough to describe the flow
of information. In this paper, we consider the whole attention block
--multi-head attention, residual connection, and layer normalization-- and
define a metric to measure token-to-token interactions within each layer,
considering the characteristics of the representation space. Then, we aggregate
layer-wise interpretations to provide input attribution scores for model
predictions. Experimentally, we show that our method, ALTI (Aggregation of
Layer-wise Token-to-token Interactions), provides faithful explanations and
outperforms similar aggregation methods.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Quantifying Context Mixing in Transformers [13.98583981770322]
Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models.
We propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer.
arXiv Detail & Related papers (2023-01-30T15:19:02Z) - Integrative Feature and Cost Aggregation with Transformers for Dense
Correspondence [63.868905184847954]
The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation.
We propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information.
We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks.
arXiv Detail & Related papers (2022-09-19T03:33:35Z) - Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Quantifying Attention Flow in Transformers [12.197250533100283]
"self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer.
This makes attention weights unreliable as explanations probes.
We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
arXiv Detail & Related papers (2020-05-02T21:45:27Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z) - Telling BERT's full story: from Local Attention to Global Aggregation [14.92157586545743]
We take a deep look into the behavior of self-attention heads in the transformer architecture.
We show that attention distributions can nevertheless provide insights into the local behavior of attention heads.
arXiv Detail & Related papers (2020-04-10T01:36:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.