Learning from Lexical Perturbations for Consistent Visual Question
Answering
- URL: http://arxiv.org/abs/2011.13406v2
- Date: Wed, 23 Dec 2020 00:29:27 GMT
- Title: Learning from Lexical Perturbations for Consistent Visual Question
Answering
- Authors: Spencer Whitehead, Hui Wu, Yi Ren Fung, Heng Ji, Rogerio Feris, Kate
Saenko
- Abstract summary: Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
- Score: 78.21912474223926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Visual Question Answering (VQA) models are often fragile and
sensitive to input variations. In this paper, we propose a novel approach to
address this issue based on modular networks, which creates two questions
related by linguistic perturbations and regularizes the visual reasoning
process between them to be consistent during training. We show that our
framework markedly improves consistency and generalization ability,
demonstrating the value of controlled linguistic perturbations as a useful and
currently underutilized training and regularization tool for VQA models. We
also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and
augmentation pipeline to create controllable linguistic variations of VQA
questions. Our benchmark uniquely draws from large-scale linguistic resources,
avoiding human annotation effort while maintaining data quality compared to
generative approaches. We benchmark existing VQA models using VQA P2 and
provide robustness analysis on each type of linguistic variation.
Related papers
- The curse of language biases in remote sensing VQA: the role of spatial
attributes, language diversity, and the need for clear evaluation [32.7348470366509]
The goal of RSVQA is to answer a question formulated in natural language about a remote sensing image.
The problem of language biases is often overlooked in the remote sensing community.
The present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy.
arXiv Detail & Related papers (2023-11-28T13:45:15Z) - Logical Implications for Visual Question Answering Consistency [2.005299372367689]
We introduce a new consistency loss term that can be used by a wide range of the VQA models.
We propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function.
We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models.
arXiv Detail & Related papers (2023-03-16T16:00:18Z) - Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z) - Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances [17.637150597493463]
We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances.
We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space.
Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
arXiv Detail & Related papers (2022-09-18T10:30:44Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.