Related papers: Generic Attention-model Explainability by Weighted Relevance Accumulation

Generic Attention-model Explainability by Weighted Relevance Accumulation

URL: http://arxiv.org/abs/2308.10240v1
Date: Sun, 20 Aug 2023 12:02:30 GMT
Title: Generic Attention-model Explainability by Weighted Relevance Accumulation
Authors: Yiming Huang, Aozhe Jia, Xiaodan Zhang, Jiawei Zhang
Abstract summary: We propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks.
Score: 9.816810016935541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention-based transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation. In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning validate that our explainability method outperforms existing methods.

Related papers

Multi-Token Attention [42.038277620194]
We propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. Our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity.
arXiv Detail & Related papers (2025-04-01T15:59:32Z)
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach. AttentionPredictor accurately predicts the attention score while consuming negligible memory. We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling. Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z)
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness. To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification [101.49122450005869]
We present a counterfactual attention learning method to learn more effective attention based on causal inference. Specifically, we analyze the effect of the learned visual attention on network prediction. We evaluate our method on a wide range of fine-grained recognition tasks.
arXiv Detail & Related papers (2021-08-19T14:53:40Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps. We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Quantifying Attention Flow in Transformers [12.197250533100283]
"self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. This makes attention weights unreliable as explanations probes. We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow.
arXiv Detail & Related papers (2020-05-02T21:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.