Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering
- URL: http://arxiv.org/abs/2012.11528v1
- Date: Thu, 17 Dec 2020 12:30:12 GMT
- Title: Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering
- Authors: Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, and Yongdong
Zhang
- Abstract summary: Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
- Score: 62.88124382512111
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most Visual Question Answering (VQA) models suffer from the language prior
problem, which is caused by inherent data biases. Specifically, VQA models tend
to answer questions (e.g., what color is the banana?) based on the
high-frequency answers (e.g., yellow) ignoring image contents. Existing
approaches tackle this problem by creating delicate models or introducing
additional visual annotations to reduce question dependency while strengthening
image dependency. However, they are still subject to the language prior problem
since the data biases have not been even alleviated. In this paper, we
introduce a self-supervised learning framework to solve this problem.
Concretely, we first automatically generate labeled data to balance the biased
data, and propose a self-supervised auxiliary task to utilize the balanced data
to assist the base VQA model to overcome language priors. Our method can
compensate for the data biases by generating balanced data without introducing
external annotations. Experimental results show that our method can
significantly outperform the state-of-the-art, improving the overall accuracy
from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other
words, we can increase the performance of annotation-based methods by 16%
without using external annotations.
Related papers
- The curse of language biases in remote sensing VQA: the role of spatial
attributes, language diversity, and the need for clear evaluation [32.7348470366509]
The goal of RSVQA is to answer a question formulated in natural language about a remote sensing image.
The problem of language biases is often overlooked in the remote sensing community.
The present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy.
arXiv Detail & Related papers (2023-11-28T13:45:15Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Improving Selective Visual Question Answering by Learning from Your
Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong.
We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions.
Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - Greedy Gradient Ensemble for Robust Visual Question Answering [163.65789778416172]
We stress the language bias in Visual Question Answering (VQA) that comes from two aspects, i.e., distribution bias and shortcut bias.
We propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning.
GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
arXiv Detail & Related papers (2021-07-27T08:02:49Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Reducing Language Biases in Visual Question Answering with
Visually-Grounded Question Encoder [12.56413718364189]
We propose a novel model-agnostic question encoder, Visually-Grounded Question (VGQE) for VQA.
VGQE utilizes both visual and language modalities equally while encoding the question.
We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-07-13T05:36:36Z) - Visual Grounding Methods for VQA are Working for the Wrong Reasons! [24.84797949716142]
We show that the performance improvements are not a result of improved visual grounding, but a regularization effect.
We propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
arXiv Detail & Related papers (2020-04-12T21:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.