Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances
- URL: http://arxiv.org/abs/2209.08529v1
- Date: Sun, 18 Sep 2022 10:30:44 GMT
- Title: Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances
- Authors: Yike Wu, Yu Zhao, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, Guoqing Zhao,
Ning Jiang
- Abstract summary: We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances.
We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space.
Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
- Score: 17.637150597493463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the great progress of Visual Question Answering (VQA), current VQA
models heavily rely on the superficial correlation between the question type
and its corresponding frequent answers (i.e., language priors) to make
predictions, without really understanding the input. In this work, we define
the training instances with the same question type but different answers as
\textit{superficially similar instances}, and attribute the language priors to
the confusion of VQA model on such instances. To solve this problem, we propose
a novel training framework that explicitly encourages the VQA model to
distinguish between the superficially similar instances. Specifically, for each
training instance, we first construct a set that contains its superficially
similar counterparts. Then we exploit the proposed distinguishing module to
increase the distance between the instance and its counterparts in the answer
space. In this way, the VQA model is forced to further focus on the other parts
of the input beyond the question type, which helps to overcome the language
priors. Experimental results show that our method achieves the state-of-the-art
performance on VQA-CP v2. Codes are available at
\href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.
Related papers
- Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder [12.56413718364189]
We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-07-13T05:36:36Z) - Estimating semantic structure for the VQA answer space [6.49970685896541]
We show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models.
We report SOTA-level performance on the challenging VQAv2-CP dataset.
arXiv Detail & Related papers (2020-06-10T08:32:56Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.