SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context
in Visual Question Answering
- URL: http://arxiv.org/abs/2204.02285v1
- Date: Tue, 5 Apr 2022 15:32:25 GMT
- Title: SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context
in Visual Question Answering
- Authors: Vipul Gupta, Zhuowan Li, Adam Kortylewski, Chenyu Zhang, Yingwei Li,
Alan Yuille
- Abstract summary: We study the robustness of Visual Question Answering (VQA) models from a novel perspective: visual context.
SwapMix perturbs the visual context by swapping features of irrelevant context objects with features from other objects in the dataset.
We train the models with perfect sight and find that the context over-reliance highly depends on the quality of visual representations.
- Score: 20.35687327831644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Visual Question Answering (VQA) has progressed rapidly, previous works
raise concerns about robustness of current VQA models. In this work, we study
the robustness of VQA models from a novel perspective: visual context. We
suggest that the models over-rely on the visual context, i.e., irrelevant
objects in the image, to make predictions. To diagnose the model's reliance on
visual context and measure their robustness, we propose a simple yet effective
perturbation technique, SwapMix. SwapMix perturbs the visual context by
swapping features of irrelevant context objects with features from other
objects in the dataset. Using SwapMix we are able to change answers to more
than 45 % of the questions for a representative VQA model. Additionally, we
train the models with perfect sight and find that the context over-reliance
highly depends on the quality of visual representations. In addition to
diagnosing, SwapMix can also be applied as a data augmentation strategy during
training in order to regularize the context over-reliance. By swapping the
context object features, the model reliance on context can be suppressed
effectively. Two representative VQA models are studied using SwapMix: a
co-attention model MCAN and a large-scale pretrained model LXMERT. Our
experiments on the popular GQA dataset show the effectiveness of SwapMix for
both diagnosing model robustness and regularizing the over-reliance on visual
context. The code for our method is available at
https://github.com/vipulgupta1011/swapmix
Related papers
- VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Using Visual Cropping to Enhance Fine-Detail Question Answering of
BLIP-Family Models [6.063024872936599]
We study whether visual cropping can improve the performance of state-of-the-art visual question answering models on fine-detail questions.
We devise two automatic cropping strategies based on multi-modal embedding by CLIP and BLIP visual QA model gradients.
We gain an improvement of 4.59% (absolute) in the general VQA-random task by simply inputting a concatenation of the original and gradient-based cropped images.
arXiv Detail & Related papers (2023-05-31T22:48:27Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - All You May Need for VQA are Image Captions [24.634567673906666]
We propose a method that automatically derives VQA examples at volume.
We show that the resulting data is of high-quality.
VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits.
arXiv Detail & Related papers (2022-05-04T04:09:23Z) - Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering [62.88124382512111]
Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
arXiv Detail & Related papers (2020-12-17T12:30:12Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.