All You May Need for VQA are Image Captions
- URL: http://arxiv.org/abs/2205.01883v1
- Date: Wed, 4 May 2022 04:09:23 GMT
- Title: All You May Need for VQA are Image Captions
- Authors: Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan
Ding, Radu Soricut
- Abstract summary: We propose a method that automatically derives VQA examples at volume.
We show that the resulting data is of high-quality.
VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits.
- Score: 24.634567673906666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) has benefited from increasingly sophisticated
models, but has not enjoyed the same level of engagement in terms of data
creation. In this paper, we propose a method that automatically derives VQA
examples at volume, by leveraging the abundance of existing image-caption
annotations combined with neural models for textual question generation. We
show that the resulting data is of high-quality. VQA models trained on our data
improve state-of-the-art zero-shot accuracy by double digits and achieve a
level of robustness that lacks in the same model trained on human-annotated VQA
data.
Related papers
- Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model [4.41132900194195]
We propose a new method called it chain of QA for human-written questions (CoQAH)
CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions.
We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images.
arXiv Detail & Related papers (2024-01-12T06:49:49Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Generative Visual Question Answering [0.0]
This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization.
We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion.
Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images.
arXiv Detail & Related papers (2023-07-18T05:30:23Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.