SparseBERT: Rethinking the Importance Analysis in Self-attention
- URL: http://arxiv.org/abs/2102.12871v1
- Date: Thu, 25 Feb 2021 14:13:44 GMT
- Title: SparseBERT: Rethinking the Importance Analysis in Self-attention
- Authors: Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li,
James T. Kwok
- Abstract summary: Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
- Score: 107.68072039537311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models are popular for natural language processing (NLP)
tasks due to its powerful capacity. As the core component, self-attention
module has aroused widespread interests. Attention map visualization of a
pre-trained model is one direct method for understanding self-attention
mechanism and some common patterns are observed in visualization. Based on
these patterns, a series of efficient transformers are proposed with
corresponding sparse attention masks. Besides above empirical results,
universal approximability of Transformer-based models is also discovered from a
theoretical perspective. However, above understanding and analysis of
self-attention is based on a pre-trained model. To rethink the importance
analysis in self-attention, we delve into dynamics of attention matrix
importance during pre-training. One of surprising results is that the diagonal
elements in the attention map are the most unimportant compared with other
attention positions and we also provide a proof to show these elements can be
removed without damaging the model performance. Furthermore, we propose a
Differentiable Attention Mask (DAM) algorithm, which can be also applied in
guidance of SparseBERT design further. The extensive experiments verify our
interesting findings and illustrate the effect of our proposed algorithm.
Related papers
- On Explaining with Attention Matrices [2.1178416840822027]
This paper explores the possible explanatory link between attention weights (AW) in transformer models and predicted output.
We introduce and effectively compute efficient attention, which isolates the effective components of attention matrices in tasks and models in which AW play an explanatory role.
arXiv Detail & Related papers (2024-10-24T08:43:33Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Noise-Free Explanation for Driving Action Prediction [11.330363757618379]
We propose an easy-to-implement but effective way to remedy this flaw: Smooth Noise Norm Attention (SNNA)
We weigh the attention by the norm of the transformed value vector and guide the label-specific signal with the attention gradient, then randomly sample the input perturbations and average the corresponding gradients to produce noise-free attribution.
Both qualitative and quantitative evaluation results show the superiority of SNNA compared to other SOTA attention-based explainable methods in generating a clearer visual explanation map and ranking the input pixel importance.
arXiv Detail & Related papers (2024-07-08T19:21:24Z) - Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information.
In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z) - Naturalness of Attention: Revisiting Attention in Code Language Models [3.756550107432323]
Language models for code such as CodeBERT offer the capability to learn advanced source code representation, but their opacity poses barriers to understanding of captured properties.
This study aims to shed some light on the previously ignored factors of the attention mechanism beyond the attention weights.
arXiv Detail & Related papers (2023-11-22T16:34:12Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks.
This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights.
We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z) - Effective Attention Sheds Light On Interpretability [3.317258557707008]
We ask whether visualizing effective attention gives different conclusions than interpretation of standard attention.
We show that effective attention is less associated with the features related to the language modeling pretraining.
We recommend using effective attention for studying a transformer's behavior since it is more pertinent to the model output by design.
arXiv Detail & Related papers (2021-05-18T23:41:26Z) - Input-independent Attention Weights Are Expressive Enough: A Study of
Attention in Self-supervised Audio Transformers [55.40032342541187]
We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks.
Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
arXiv Detail & Related papers (2020-06-09T10:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.