The heads hypothesis: A unifying statistical approach towards
understanding multi-headed attention in BERT
- URL: http://arxiv.org/abs/2101.09115v1
- Date: Fri, 22 Jan 2021 14:10:59 GMT
- Title: The heads hypothesis: A unifying statistical approach towards
understanding multi-headed attention in BERT
- Authors: Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar and
Mitesh M. Khapra
- Abstract summary: Multi-headed attention heads are a mainstay in transformer-based models.
Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention.
We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
- Score: 18.13834903235249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-headed attention heads are a mainstay in transformer-based models.
Different methods have been proposed to classify the role of each attention
head based on the relations between tokens which have high pair-wise attention.
These roles include syntactic (tokens with some syntactic relation), local
(nearby tokens), block (tokens in the same sentence) and delimiter (the special
[CLS], [SEP] tokens). There are two main challenges with existing methods for
classification: (a) there are no standard scores across studies or across
functional roles, and (b) these scores are often average quantities measured
across sentences without capturing statistical significance. In this work, we
formalize a simple yet effective score that generalizes to all the roles of
attention heads and employs hypothesis testing on this score for robust
inference. This provides us the right lens to systematically analyze attention
heads and confidently comment on many commonly posed questions on analyzing the
BERT model. In particular, we comment on the co-location of multiple functional
roles in the same attention head, the distribution of attention heads across
layers, and effect of fine-tuning for specific NLP tasks on these functional
roles.
Related papers
- Disentangling Interactions and Dependencies in Feature Attribution [9.442326245744916]
In machine learning, global feature importance methods try to determine how much each individual feature contributes to predicting a target variable.
In commonly used feature importance scores these cooperative effects are conflated with the features' individual contributions.
We derive DIP, a new mathematical decomposition of individual feature importance scores that disentangles three components.
arXiv Detail & Related papers (2024-10-31T09:41:10Z) - An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings.
We establish the existence of a sufficient and minimal representation of input tokens.
We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z) - Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z) - A Song of (Dis)agreement: Evaluating the Evaluation of Explainable
Artificial Intelligence in Natural Language Processing [7.527234046228323]
We argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations.
We find that attention-based explanations do not correlate strongly with any recent feature attribution methods.
arXiv Detail & Related papers (2022-05-09T21:07:39Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - ACP++: Action Co-occurrence Priors for Human-Object Interaction
Detection [102.9428507180728]
A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples.
We observe that there exist natural correlations and anti-correlations among human-object interactions.
We present techniques to learn these priors and leverage them for more effective training, especially on rare classes.
arXiv Detail & Related papers (2021-09-09T06:02:50Z) - Nested Counterfactual Identification from Arbitrary Surrogate
Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments.
Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - Multi-Head Self-Attention with Role-Guided Masks [20.955992710112216]
We propose a method to guide the attention heads towards roles identified in prior work as important.
We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input.
Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
arXiv Detail & Related papers (2020-12-22T21:34:02Z) - On the Importance of Local Information in Transformer Based Models [19.036044858449593]
The self-attention module is a key component of Transformer-based models.
Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour.
We show that a larger fraction of heads have a locality bias as compared to a syntactic bias.
arXiv Detail & Related papers (2020-08-13T11:32:47Z) - Detecting Human-Object Interactions with Action Co-occurrence Priors [108.31956827512376]
A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples.
We observe that there exist natural correlations and anti-correlations among human-object interactions.
We present techniques to learn these priors and leverage them for more effective training, especially in rare classes.
arXiv Detail & Related papers (2020-07-17T02:47:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.