Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual
Reasoning
- URL: http://arxiv.org/abs/2212.00259v2
- Date: Thu, 1 Jun 2023 03:57:12 GMT
- Title: Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual
Reasoning
- Authors: Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei
Ma, Benjamin Van Durme, Alan Yuille
- Abstract summary: We introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated.
Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality.
With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data.
- Score: 34.6700781893352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) models often perform poorly on
out-of-distribution data and struggle on domain generalization. Due to the
multi-modal nature of this task, multiple factors of variation are intertwined,
making generalization difficult to analyze. This motivates us to introduce a
virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts
can be isolated in order that their effects can be studied independently. Four
factors are considered: visual complexity, question redundancy, concept
distribution and concept compositionality. With controllably generated data,
Super-CLEVR enables us to test VQA methods in situations where the test data
differs from the training data along each of these axes. We study four existing
methods, including two neural symbolic methods NSCL and NSVQA, and two
non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic
NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA
outperforms other methods on three of the four domain shift factors. Our
results suggest that disentangling reasoning and perception, combined with
probabilistic uncertainty, form a strong VQA model that is more robust to
domain shifts. The dataset and code are released at
https://github.com/Lizw14/Super-CLEVR.
Related papers
- Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering [19.351516992903697]
We propose emphMixture of Rationales (MoR), a novel multi-modal reasoning method that mixes multiple rationales for zero-shot visual question answering.
MoR achieves a 12.43% accuracy improvement on NLVR2, and a 2.45% accuracy improvement on OKVQA-S.
arXiv Detail & Related papers (2024-06-03T15:04:47Z) - HyperVQ: MLR-based Vector Quantization in Hyperbolic Space [56.4245885674567]
We study the use of hyperbolic spaces for vector quantization (HyperVQ)
We show that hyperVQ performs comparably in reconstruction and generative tasks while outperforming VQ in discriminative tasks and learning a highly disentangled latent space.
arXiv Detail & Related papers (2024-03-18T03:17:08Z) - VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization [15.554325659263316]
Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities.
Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts.
We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline.
arXiv Detail & Related papers (2023-11-01T19:43:56Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z) - Domain-robust VQA with diverse datasets and methods but no target labels [34.331228652254566]
Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity.
To tackle these challenges, we first quantify domain shifts between popular VQA datasets.
We also construct synthetic shifts in the image and question domains separately.
arXiv Detail & Related papers (2021-03-29T22:24:50Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - Regularizing Attention Networks for Anomaly Detection in Visual Question
Answering [10.971443035470488]
We evaluate the robustness of state-of-the-art VQA models to five different anomalies.
We propose an attention-based method, which uses confidence of reasoning between input images and questions.
We show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection.
arXiv Detail & Related papers (2020-09-21T17:47:49Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.