Related papers: The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

URL: http://arxiv.org/abs/2101.09115v1
Date: Fri, 22 Jan 2021 14:10:59 GMT
Title: The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT
Authors: Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar and Mitesh M. Khapra
Abstract summary: Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
Score: 18.13834903235249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles.

Related papers

Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs [77.43913758420948]
We propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. More than 4% of a model's attention heads can be zeroed while maintaining average accuracy. dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining.
arXiv Detail & Related papers (2025-04-04T19:28:23Z)
Enforcing Fundamental Relations via Adversarial Attacks on Input Parameter Correlations [76.2226569692207]
Correlations between input parameters play a crucial role in many scientific classification tasks. We present a new adversarial attack algorithm called Random Distribution Shuffle Attack (RDSA) We demonstrate the RDSA effectiveness on six classification tasks.
arXiv Detail & Related papers (2025-01-09T21:45:09Z)
Inferring Functionality of Attention Heads from their Parameters [12.913172023910203]
We propose a framework that infers the functionality of attention heads from their parameters, without any model training or inference. We evaluate MAPS on 20 operations across 6 popular large language models (LLMs) Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations.
arXiv Detail & Related papers (2024-12-16T16:45:33Z)
Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning [0.0]
Language models (LMs) struggle to perform multi-hop reasoning consistently. We propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LM attention heads.
arXiv Detail & Related papers (2024-11-06T16:30:26Z)
Disentangling Interactions and Dependencies in Feature Attribution [9.442326245744916]
In machine learning, global feature importance methods try to determine how much each individual feature contributes to predicting a target variable. In commonly used feature importance scores these cooperative effects are conflated with the features' individual contributions. We derive DIP, a new mathematical decomposition of individual feature importance scores that disentangles three components.
arXiv Detail & Related papers (2024-10-31T09:41:10Z)
An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings. We establish the existence of a sufficient and minimal representation of input tokens. We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z)
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity. We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach. Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z)
A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing [7.527234046228323]
We argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations. We find that attention-based explanations do not correlate strongly with any recent feature attribution methods.
arXiv Detail & Related papers (2022-05-09T21:07:39Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
ACP++: Action Co-occurrence Priors for Human-Object Interaction Detection [102.9428507180728]
A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples. We observe that there exist natural correlations and anti-correlations among human-object interactions. We present techniques to learn these priors and leverage them for more effective training, especially on rare classes.
arXiv Detail & Related papers (2021-09-09T06:02:50Z)
Nested Counterfactual Identification from Arbitrary Surrogate Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z)
Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers. We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z)
On the Importance of Local Information in Transformer Based Models [19.036044858449593]
The self-attention module is a key component of Transformer-based models. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
arXiv Detail & Related papers (2020-08-13T11:32:47Z)
Detecting Human-Object Interactions with Action Co-occurrence Priors [108.31956827512376]
A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples. We observe that there exist natural correlations and anti-correlations among human-object interactions. We present techniques to learn these priors and leverage them for more effective training, especially in rare classes.
arXiv Detail & Related papers (2020-07-17T02:47:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.