Self-Supervised VQA: Answering Visual Questions using Images and
Captions
- URL: http://arxiv.org/abs/2012.02356v1
- Date: Fri, 4 Dec 2020 01:22:05 GMT
- Title: Self-Supervised VQA: Answering Visual Questions using Images and
Captions
- Authors: Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral
- Abstract summary: VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
- Score: 38.05223339919346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Methodologies for training VQA models assume the availability of datasets
with human-annotated Image-Question-Answer(I-Q-A) triplets for training. This
has led to a heavy reliance and overfitting on datasets and a lack of
generalization to new types of questions and scenes. Moreover, these datasets
exhibit annotator subjectivity, biases, and errors, along with linguistic
priors, which percolate into VQA models trained on such samples. We study
whether models can be trained without any human-annotated Q-A pairs, but only
with images and associated text captions which are descriptive and less
subjective. We present a method to train models with procedurally generated Q-A
pairs from captions using techniques, such as templates and annotation
frameworks like QASRL. As most VQA models rely on dense and costly object
annotations extracted from object detectors, we propose spatial-pyramid image
patches as a simple but effective alternative to object bounding boxes, and
demonstrate that our method uses fewer human annotations. We benchmark on
VQA-v2, GQA, and on VQA-CP which contains a softer version of label shift. Our
methods surpass prior supervised methods on VQA-CP and are competitive with
methods without object features in fully supervised setting.
Related papers
- Improved Few-Shot Image Classification Through Multiple-Choice Questions [1.4432605069307167]
We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question.
We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks.
arXiv Detail & Related papers (2024-07-23T03:09:42Z) - Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model [4.41132900194195]
We propose a new method called it chain of QA for human-written questions (CoQAH)
CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions.
We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images.
arXiv Detail & Related papers (2024-01-12T06:49:49Z) - All You May Need for VQA are Image Captions [24.634567673906666]
We propose a method that automatically derives VQA examples at volume.
We show that the resulting data is of high-quality.
VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits.
arXiv Detail & Related papers (2022-05-04T04:09:23Z) - Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z) - Discovering the Unknown Knowns: Turning Implicit Knowledge in the
Dataset into Explicit Training Examples for Visual Question Answering [18.33311267792116]
We find that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly.
We present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA.
arXiv Detail & Related papers (2021-09-13T16:56:43Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Learning Compositional Representation for Few-shot Visual Question
Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.