Causal Attention for Vision-Language Tasks
- URL: http://arxiv.org/abs/2103.03493v1
- Date: Fri, 5 Mar 2021 06:38:25 GMT
- Title: Causal Attention for Vision-Language Tasks
- Authors: Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai
- Abstract summary: We present a novel attention mechanism: Causal Attention (CATT)
CATT removes the ever-elusive confounding effect in existing attention-based vision-language models.
In particular, we show that CATT has great potential in large-scale pre-training.
- Score: 142.82608295995652
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present a novel attention mechanism: Causal Attention (CATT), to remove
the ever-elusive confounding effect in existing attention-based vision-language
models. This effect causes harmful bias that misleads the attention module to
focus on the spurious correlations in training data, damaging the model
generalization. As the confounder is unobserved in general, we use the
front-door adjustment to realize the causal intervention, which does not
require any knowledge on the confounder. Specifically, CATT is implemented as a
combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention
(CS-ATT), where the latter forcibly brings other samples into every IS-ATT,
mimicking the causal intervention. CATT abides by the Q-K-V convention and
hence can replace any attention module such as top-down attention and
self-attention in Transformers. CATT improves various popular attention-based
vision-language models by considerable margins. In particular, we show that
CATT has great potential in large-scale pre-training, e.g., it can promote the
lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less
computational power, comparable to the heavier UNITER~\cite{chen2020uniter}.
Code is published in \url{https://github.com/yangxuntu/catt}.
Related papers
- Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement [68.31147013783387]
We observe that the attention mechanism is vulnerable to patch-based adversarial attacks.
In this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model.
arXiv Detail & Related papers (2024-01-03T13:58:35Z) - Context De-confounded Emotion Recognition [12.037240778629346]
Context-Aware Emotion Recognition (CAER) aims to perceive the emotional states of the target person with contextual information.
A long-overlooked issue is that a context bias in existing datasets leads to a significantly unbalanced distribution of emotional states.
This paper provides a causality-based perspective to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task.
arXiv Detail & Related papers (2023-03-21T15:12:20Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - A Context-Aware Feature Fusion Framework for Punctuation Restoration [28.38472792385083]
We propose a novel Feature Fusion framework based on two-type Attentions (FFA) to alleviate the shortage of attention.
Experiments on the popular benchmark dataset IWSLT demonstrate that our approach is effective.
arXiv Detail & Related papers (2022-03-23T15:29:28Z) - Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images.
Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation.
We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z) - Impact of Attention on Adversarial Robustness of Image Classification
Models [0.9176056742068814]
Adrial attacks against deep learning models have gained significant attention.
Recent works have proposed explanations for the existence of adversarial examples and techniques to defend the models against these attacks.
This work aims at a general understanding of the impact of attention on adversarial robustness.
arXiv Detail & Related papers (2021-09-02T13:26:32Z) - Causal Attention for Unbiased Visual Recognition [76.87114090435618]
Attention module does not always help deep models learn causal features that are robust in any confounding context.
We propose causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion.
In OOD settings, deep models with CaaM outperform those without it significantly.
arXiv Detail & Related papers (2021-08-19T16:45:51Z) - More Than Just Attention: Learning Cross-Modal Attentions with
Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation.
CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations.
Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.