Related papers: Causal Attention for Unbiased Visual Recognition

Causal Attention for Unbiased Visual Recognition

URL: http://arxiv.org/abs/2108.08782v1
Date: Thu, 19 Aug 2021 16:45:51 GMT
Title: Causal Attention for Unbiased Visual Recognition
Authors: Tan Wang, Chang Zhou, Qianru Sun, Hanwang Zhang
Abstract summary: Attention module does not always help deep models learn causal features that are robust in any confounding context. We propose causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion. In OOD settings, deep models with CaaM outperform those without it significantly.
Score: 76.87114090435618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention module does not always help deep models learn causal features that are robust in any confounding context, e.g., a foreground object feature is invariant to different backgrounds. This is because the confounders trick the attention to capture spurious correlations that benefit the prediction when the training and testing data are IID (identical & independent distribution); while harm the prediction when the data are OOD (out-of-distribution). The sole fundamental solution to learn causal attention is by causal intervention, which requires additional annotations of the confounders, e.g., a "dog" model is learned within "grass+dog" and "road+dog" respectively, so the "grass" and "road" contexts will no longer confound the "dog" recognition. However, such annotation is not only prohibitively expensive, but also inherently problematic, as the confounders are elusive in nature. In this paper, we propose a causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion. In particular, multiple CaaMs can be stacked and integrated in conventional attention CNN and self-attention Vision Transformer. In OOD settings, deep models with CaaM outperform those without it significantly; even in IID settings, the attention localization is also improved by CaaM, showing a great potential in applications that require robust visual saliency. Codes are available at \url{https://github.com/Wangt-CN/CaaM}.

Related papers

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability. We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations. Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes [35.36331164446824]
We propose a framework based on a Structural Causal Model (SCM) In this framework, VisualBERT is trained to predict the class of an input meme based on both meme input and causal concepts. We find that input attribution methods do not guarantee causality within our framework, raising questions about their reliability in safety-critical applications.
arXiv Detail & Related papers (2024-10-17T12:32:00Z)
When Attention Sink Emerges in Language Models: An Empirical View [39.36282162213973]
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
arXiv Detail & Related papers (2024-10-14T17:50:28Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
Learning Target-aware Representation for Visual Tracking via Informative Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z)
Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way. We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z)
Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
Causal Attention for Vision-Language Tasks [142.82608295995652]
We present a novel attention mechanism: Causal Attention (CATT) CATT removes the ever-elusive confounding effect in existing attention-based vision-language models. In particular, we show that CATT has great potential in large-scale pre-training.
arXiv Detail & Related papers (2021-03-05T06:38:25Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.