Causal Attention for Unbiased Visual Recognition
- URL: http://arxiv.org/abs/2108.08782v1
- Date: Thu, 19 Aug 2021 16:45:51 GMT
- Title: Causal Attention for Unbiased Visual Recognition
- Authors: Tan Wang, Chang Zhou, Qianru Sun, Hanwang Zhang
- Abstract summary: Attention module does not always help deep models learn causal features that are robust in any confounding context.
We propose causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion.
In OOD settings, deep models with CaaM outperform those without it significantly.
- Score: 76.87114090435618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention module does not always help deep models learn causal features that
are robust in any confounding context, e.g., a foreground object feature is
invariant to different backgrounds. This is because the confounders trick the
attention to capture spurious correlations that benefit the prediction when the
training and testing data are IID (identical & independent distribution); while
harm the prediction when the data are OOD (out-of-distribution). The sole
fundamental solution to learn causal attention is by causal intervention, which
requires additional annotations of the confounders, e.g., a "dog" model is
learned within "grass+dog" and "road+dog" respectively, so the "grass" and
"road" contexts will no longer confound the "dog" recognition. However, such
annotation is not only prohibitively expensive, but also inherently
problematic, as the confounders are elusive in nature. In this paper, we
propose a causal attention module (CaaM) that self-annotates the confounders in
unsupervised fashion. In particular, multiple CaaMs can be stacked and
integrated in conventional attention CNN and self-attention Vision Transformer.
In OOD settings, deep models with CaaM outperform those without it
significantly; even in IID settings, the attention localization is also
improved by CaaM, showing a great potential in applications that require robust
visual saliency. Codes are available at \url{https://github.com/Wangt-CN/CaaM}.
Related papers
- Unsupervised Keypoints from Pretrained Diffusion Models [31.147785019795347]
We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints.
Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images.
We validate our performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets.
arXiv Detail & Related papers (2023-11-29T19:43:38Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - Learning Target-aware Representation for Visual Tracking via Informative
Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Causal Attention for Vision-Language Tasks [142.82608295995652]
We present a novel attention mechanism: Causal Attention (CATT)
CATT removes the ever-elusive confounding effect in existing attention-based vision-language models.
In particular, we show that CATT has great potential in large-scale pre-training.
arXiv Detail & Related papers (2021-03-05T06:38:25Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.