Contrast and Classify: Training Robust VQA Models
- URL: http://arxiv.org/abs/2010.06087v2
- Date: Mon, 19 Apr 2021 03:45:27 GMT
- Title: Contrast and Classify: Training Robust VQA Models
- Authors: Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
- Abstract summary: We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
- Score: 60.80627814762071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Visual Question Answering (VQA) models have shown impressive
performance on the VQA benchmark but remain sensitive to small linguistic
variations in input questions. Existing approaches address this by augmenting
the dataset with question paraphrases from visual question generation models or
adversarial perturbations. These approaches use the combined data to learn an
answer classifier by minimizing the standard cross-entropy loss. To more
effectively leverage augmented data, we build on the recent success in
contrastive learning. We propose a novel training paradigm (ConClaT) that
optimizes both cross-entropy and contrastive losses. The contrastive loss
encourages representations to be robust to linguistic variations in questions
while the cross-entropy loss preserves the discriminative power of
representations for answer prediction.
We find that optimizing both losses -- either alternately or jointly -- is
key to effective training. On the VQA-Rephrasings benchmark, which measures the
VQA model's answer consistency across human paraphrases of a question, ConClaT
improves Consensus Score by 1 .63% over an improved baseline. In addition, on
the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall.
We also show that ConClaT is agnostic to the type of data-augmentation strategy
used.
Related papers
- Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Logical Implications for Visual Question Answering Consistency [2.005299372367689]
We introduce a new consistency loss term that can be used by a wide range of the VQA models.
We propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function.
We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models.
arXiv Detail & Related papers (2023-03-16T16:00:18Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - IQ-VQA: Intelligent Visual Question Answering [3.09911862091928]
We show that our framework improves consistency of VQA models by 15% on the rule-based dataset.
We also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
arXiv Detail & Related papers (2020-07-08T20:41:52Z) - Estimating semantic structure for the VQA answer space [6.49970685896541]
We show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models.
We report SOTA-level performance on the challenging VQAv2-CP dataset.
arXiv Detail & Related papers (2020-06-10T08:32:56Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.