Generative Visual Question Answering
- URL: http://arxiv.org/abs/2307.10405v1
- Date: Tue, 18 Jul 2023 05:30:23 GMT
- Title: Generative Visual Question Answering
- Authors: Ethan Shen, Scotty Singh, Bhavesh Kumar
- Abstract summary: This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization.
We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion.
Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal tasks involving vision and language in deep learning continue to
rise in popularity and are leading to the development of newer models that can
generalize beyond the extent of their training data. The current models lack
temporal generalization which enables models to adapt to changes in future
data. This paper discusses a viable approach to creating an advanced Visual
Question Answering (VQA) model which can produce successful results on temporal
generalization. We propose a new data set, GenVQA, utilizing images and
captions from the VQAv2 and MS-COCO dataset to generate new images through
stable diffusion. This augmented dataset is then used to test a combination of
seven baseline and cutting edge VQA models. Performance evaluation focuses on
questions mirroring the original VQAv2 dataset, with the answers having been
adjusted to the new images. This paper's purpose is to investigate the
robustness of several successful VQA models to assess their performance on
future data distributions. Model architectures are analyzed to identify common
stylistic choices that improve generalization under temporal distribution
shifts. This research highlights the importance of creating a large-scale
future shifted dataset. This data can enhance the robustness of VQA models,
allowing their future peers to have improved ability to adapt to temporal
distribution shifts.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z) - All You May Need for VQA are Image Captions [24.634567673906666]
We propose a method that automatically derives VQA examples at volume.
We show that the resulting data is of high-quality.
VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits.
arXiv Detail & Related papers (2022-05-04T04:09:23Z) - SimVQA: Exploring Simulated Environments for Visual Question Answering [15.030013924109118]
We explore using synthetic computer-generated data to fully control the visual and language space.
We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data.
We propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant.
arXiv Detail & Related papers (2022-03-31T17:44:27Z) - X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization
in Visual Question Answering [49.36818290978525]
Recompositions of existing visual concepts can generate unseen compositions in the training set.
We propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly.
The baseline VQA model trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks.
arXiv Detail & Related papers (2021-07-24T10:17:48Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA
Models [45.777326168922635]
We introduce Adversarial VQA, a new large-scale VQA benchmark, collected iteratively via an adversarial human-and-model-in-the-loop procedure.
We find that non-expert annotators can successfully attack SOTA VQA models with relative ease.
Both large-scale pre-trained models and adversarial training methods can only achieve far lower performance than what they can achieve on the standard VQA v2 dataset.
arXiv Detail & Related papers (2021-06-01T05:54:41Z) - Continual Learning for Blind Image Quality Assessment [80.55119990128419]
Blind image quality assessment (BIQA) models fail to continually adapt to subpopulation shift.
Recent work suggests training BIQA methods on the combination of all available human-rated IQA datasets.
We formulate continual learning for BIQA, where a model learns continually from a stream of IQA datasets.
arXiv Detail & Related papers (2021-02-19T03:07:01Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.