Related papers: Causal Attention for Vision-Language Tasks

Causal Attention for Vision-Language Tasks

URL: http://arxiv.org/abs/2103.03493v1
Date: Fri, 5 Mar 2021 06:38:25 GMT
Title: Causal Attention for Vision-Language Tasks
Authors: Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai
Abstract summary: We present a novel attention mechanism: Causal Attention (CATT) CATT removes the ever-elusive confounding effect in existing attention-based vision-language models. In particular, we show that CATT has great potential in large-scale pre-training.
Score: 142.82608295995652
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

Related papers

More Expressive Attention with Negative Weights [36.40344438470477]
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention.
arXiv Detail & Related papers (2024-11-11T17:56:28Z)
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models [64.67721492968941]
We propose a Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) framework. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-29T07:15:09Z)
Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality [20.41579586967349]
Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia. MLLMs often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination. We propose a causal inference framework termed CausalMM that applies structural causal modeling to MLLMs.
arXiv Detail & Related papers (2024-10-07T06:45:22Z)
Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement [68.31147013783387]
We observe that the attention mechanism is vulnerable to patch-based adversarial attacks. In this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model.
arXiv Detail & Related papers (2024-01-03T13:58:35Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
A Context-Aware Feature Fusion Framework for Punctuation Restoration [28.38472792385083]
We propose a novel Feature Fusion framework based on two-type Attentions (FFA) to alleviate the shortage of attention. Experiments on the popular benchmark dataset IWSLT demonstrate that our approach is effective.
arXiv Detail & Related papers (2022-03-23T15:29:28Z)
Impact of Attention on Adversarial Robustness of Image Classification Models [0.9176056742068814]
Adrial attacks against deep learning models have gained significant attention. Recent works have proposed explanations for the existence of adversarial examples and techniques to defend the models against these attacks. This work aims at a general understanding of the impact of attention on adversarial robustness.
arXiv Detail & Related papers (2021-09-02T13:26:32Z)
Causal Attention for Unbiased Visual Recognition [76.87114090435618]
Attention module does not always help deep models learn causal features that are robust in any confounding context. We propose causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion. In OOD settings, deep models with CaaM outperform those without it significantly.
arXiv Detail & Related papers (2021-08-19T16:45:51Z)
More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints [63.08768589044052]
We propose Contrastive Content Re-sourcing ( CCR) and Contrastive Content Swapping ( CCS) constraints to address such limitation. CCR and CCS constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance.
arXiv Detail & Related papers (2021-05-20T08:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.