Generic Attention-model Explainability by Weighted Relevance
Accumulation
- URL: http://arxiv.org/abs/2308.10240v1
- Date: Sun, 20 Aug 2023 12:02:30 GMT
- Title: Generic Attention-model Explainability by Weighted Relevance
Accumulation
- Authors: Yiming Huang, Aozhe Jia, Xiaodan Zhang, Jiawei Zhang
- Abstract summary: We propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance.
To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks.
- Score: 9.816810016935541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based transformer models have achieved remarkable progress in
multi-modal tasks, such as visual question answering. The explainability of
attention-based methods has recently attracted wide interest as it can explain
the inner changes of attention tokens by accumulating relevancy across
attention layers. Current methods simply update relevancy by equally
accumulating the token relevancy before and after the attention processes.
However, the importance of token values is usually different during relevance
accumulation. In this paper, we propose a weighted relevancy strategy, which
takes the importance of token values into consideration, to reduce distortion
when equally accumulating relevance. To evaluate our method, we propose a
unified CLIP-based two-stage model, named CLIPmapper, to process
Vision-and-Language tasks through CLIP encoder and a following mapper.
CLIPmapper consists of self-attention, cross-attention, single-modality, and
cross-modality attention, thus it is more suitable for evaluating our generic
explainability method. Extensive perturbation tests on visual question
answering and image captioning validate that our explainability method
outperforms existing methods.
Related papers
- AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.
AttentionPredictor accurately predicts the attention score while consuming negligible memory.
We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision.
We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z) - Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness.
To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z) - Revisiting The Evaluation of Class Activation Mapping for
Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps.
We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Quantifying Attention Flow in Transformers [12.197250533100283]
"self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer.
This makes attention weights unreliable as explanations probes.
We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
arXiv Detail & Related papers (2020-05-02T21:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.