How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers
- URL: http://arxiv.org/abs/2211.03495v1
- Date: Mon, 7 Nov 2022 12:37:54 GMT
- Title: How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers
- Authors: Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero,
Noah A. Smith and Roy Schwartz
- Abstract summary: We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
- Score: 59.57128476584361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The attention mechanism is considered the backbone of the widely-used
Transformer architecture. It contextualizes the input by computing
input-specific attention matrices. We find that this mechanism, while powerful
and elegant, is not as important as typically thought for pretrained language
models. We introduce PAPA, a new probing method that replaces the
input-dependent attention matrices with constant ones -- the average attention
weights over multiple inputs. We use PAPA to analyze several established
pretrained Transformers on six downstream tasks. We find that without any
input-dependent attention, all models achieve competitive performance -- an
average relative drop of only 8% from the probing baseline. Further, little or
no performance drop is observed when replacing half of the input-dependent
attention matrices with constant (input-independent) ones. Interestingly, we
show that better-performing models lose more from applying our method than
weaker models, suggesting that the utilization of the input-dependent attention
mechanism might be a factor in their success. Our results motivate research on
simpler alternatives to input-dependent attention, as well as on methods for
better utilization of this mechanism in the Transformer architecture.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - DAE-Former: Dual Attention-guided Efficient Transformer for Medical
Image Segmentation [3.9548535445908928]
We propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism.
Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights.
arXiv Detail & Related papers (2022-12-27T14:39:39Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Centroid Transformers: Learning to Abstract with Attention [15.506293166377182]
Self-attention is a powerful mechanism for extracting features from the inputs.
We propose centroid attention, a generalization of self-attention that maps N inputs to M outputs $(Mleq N)$.
We apply our method to various applications, including abstractive text summarization, 3D vision, and image processing.
arXiv Detail & Related papers (2021-02-17T07:04:19Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z) - Input-independent Attention Weights Are Expressive Enough: A Study of
Attention in Self-supervised Audio Transformers [55.40032342541187]
We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks.
Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
arXiv Detail & Related papers (2020-06-09T10:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.