Related papers: Telling BERT's full story: from Local Attention to Global Aggregation

Telling BERT's full story: from Local Attention to Global Aggregation

URL: http://arxiv.org/abs/2004.05916v2
Date: Wed, 13 Jan 2021 21:48:04 GMT
Title: Telling BERT's full story: from Local Attention to Global Aggregation
Authors: Damian Pascual, Gino Brunner and Roger Wattenhofer
Abstract summary: We take a deep look into the behavior of self-attention heads in the transformer architecture. We show that attention distributions can nevertheless provide insights into the local behavior of attention heads.
Score: 14.92157586545743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We take a deep look into the behavior of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model's behavior, we show that attention distributions can nevertheless provide insights into the local behavior of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant discrepancy between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.

Related papers

On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. We quantify how tokens interact with contextual information based on their sequential positions. Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Measuring the Mixing of Contextual Information in the Transformer [0.19116784879310028]
We consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI, provides faithful explanations and outperforms similar aggregation methods.
arXiv Detail & Related papers (2022-03-08T17:21:27Z)
Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images. Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation. We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z)
Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool. We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention. We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
On the Importance of Local Information in Transformer Based Models [19.036044858449593]
The self-attention module is a key component of Transformer-based models. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
arXiv Detail & Related papers (2020-08-13T11:32:47Z)
Quantifying Attention Flow in Transformers [12.197250533100283]
"self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. This makes attention weights unreliable as explanations probes. We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
arXiv Detail & Related papers (2020-05-02T21:45:27Z)
Self-Attention Attribution: Interpreting Information Interactions Inside Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer. We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.