Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering
- URL: http://arxiv.org/abs/2110.01013v2
- Date: Sun, 25 Jun 2023 02:23:05 GMT
- Title: Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering
- Authors: Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, Jun Xiao
- Abstract summary: VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
- Score: 59.20766562530209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today's VQA models still tend to capture superficial linguistic correlations
in the training set and fail to generalize to the test set with different QA
distributions. To reduce these language biases, recent VQA works introduce an
auxiliary question-only model to regularize the training of targeted VQA model,
and achieve dominating performance on diagnostic benchmarks for
out-of-distribution testing. However, due to complex model design, these
ensemble-based methods are unable to equip themselves with two indispensable
characteristics of an ideal VQA model: 1) Visual-explainable: The model should
rely on the right visual regions when making decisions. 2) Question-sensitive:
The model should be sensitive to the linguistic variations in questions. To
this end, we propose a novel model-agnostic Counterfactual Samples Synthesizing
and Training (CSST) strategy. After training with CSST, VQA models are forced
to focus on all critical objects and words, which significantly improves both
visual-explainable and question-sensitive abilities. Specifically, CSST is
composed of two parts: Counterfactual Samples Synthesizing (CSS) and
Counterfactual Samples Training (CST). CSS generates counterfactual samples by
carefully masking critical objects in images or words in questions and
assigning pseudo ground-truth answers. CST not only trains the VQA models with
both complementary samples to predict respective ground-truth answers, but also
urges the VQA models to further distinguish the original samples and
superficially similar counterfactual ones. To facilitate the CST training, we
propose two variants of supervised contrastive loss for VQA, and design an
effective positive and negative sample selection mechanism based on CSS.
Extensive experiments have shown the effectiveness of CSST. Particularly, by
building on top of model LMH+SAR, we achieve record-breaking performance on all
OOD benchmarks.
Related papers
- Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.