VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization
- URL: http://arxiv.org/abs/2311.00807v1
- Date: Wed, 1 Nov 2023 19:43:56 GMT
- Title: VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization
- Authors: Suraj Jyothi Unni, Raha Moraffah, Huan Liu
- Abstract summary: Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities.
Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts.
We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline.
- Score: 15.554325659263316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) models are designed to demonstrate
visual-textual reasoning capabilities. However, their real-world applicability
is hindered by a lack of comprehensive benchmark datasets. Existing domain
generalization datasets for VQA exhibit a unilateral focus on textual shifts
while VQA being a multi-modal task contains shifts across both visual and
textual domains. We propose VQA-GEN, the first ever multi-modal benchmark
dataset for distribution shift generated through a shift induced pipeline.
Experiments demonstrate VQA-GEN dataset exposes the vulnerability of existing
methods to joint multi-modal distribution shifts. validating that comprehensive
multi-modal shifts are critical for robust VQA generalization. Models trained
on VQA-GEN exhibit improved cross-domain and in-domain performance, confirming
the value of VQA-GEN. Further, we analyze the importance of each shift
technique of our pipeline contributing to the generalization of the model.
Related papers
- Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach [2.744781070632757]
We compare models that leverage long-range dependencies and simpler models focusing on local textual features within a well-established VQA framework.
We propose ConvGRU, a model that incorporates convolutional layers to improve text feature representation without substantially increasing model complexity.
Tested on the VQA-v2 dataset, ConvGRU demonstrates a modest yet consistent improvement over baselines for question types such as Number and Count.
arXiv Detail & Related papers (2024-05-01T12:39:35Z) - HyperVQ: MLR-based Vector Quantization in Hyperbolic Space [56.4245885674567]
We study the use of hyperbolic spaces for vector quantization (HyperVQ)
We show that hyperVQ performs comparably in reconstruction and generative tasks while outperforming VQ in discriminative tasks and learning a highly disentangled latent space.
arXiv Detail & Related papers (2024-03-18T03:17:08Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Causal Reasoning through Two Layers of Cognition for Improving
Generalization in Visual Question Answering [28.071906755200043]
Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution.
We propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors.
CopVQA achieves a new state-of-the-art (SOTA) on PathVQA dataset and comparable accuracy to the current SOTA on VQA-CPv2, VQAv2, and VQA RAD, with one-fourth of the model size.
arXiv Detail & Related papers (2023-10-09T05:07:58Z) - Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal
Grounding [7.995360025953931]
Visual question answering (VQA) is the multi-modal task of answering natural language questions about an input image.
We are working on a VQG module that facilitate in automatically generating OOD shifts that aid in systematically evaluating cross-dataset adaptation capabilities of VQA models.
arXiv Detail & Related papers (2022-01-24T12:42:30Z) - Domain-robust VQA with diverse datasets and methods but no target labels [34.331228652254566]
Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity.
To tackle these challenges, we first quantify domain shifts between popular VQA datasets.
We also construct synthetic shifts in the image and question domains separately.
arXiv Detail & Related papers (2021-03-29T22:24:50Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.