Causal Reasoning through Two Layers of Cognition for Improving
Generalization in Visual Question Answering
- URL: http://arxiv.org/abs/2310.05410v1
- Date: Mon, 9 Oct 2023 05:07:58 GMT
- Title: Causal Reasoning through Two Layers of Cognition for Improving
Generalization in Visual Question Answering
- Authors: Trang Nguyen, Naoaki Okazaki
- Abstract summary: Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution.
We propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors.
CopVQA achieves a new state-of-the-art (SOTA) on PathVQA dataset and comparable accuracy to the current SOTA on VQA-CPv2, VQAv2, and VQA RAD, with one-fourth of the model size.
- Score: 28.071906755200043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalization in Visual Question Answering (VQA) requires models to answer
questions about images with contexts beyond the training distribution. Existing
attempts primarily refine unimodal aspects, overlooking enhancements in
multimodal aspects. Besides, diverse interpretations of the input lead to
various modes of answer generation, highlighting the role of causal reasoning
between interpreting and answering steps in VQA. Through this lens, we propose
Cognitive pathways VQA (CopVQA) improving the multimodal predictions by
emphasizing causal reasoning factors. CopVQA first operates a pool of pathways
that capture diverse causal reasoning flows through interpreting and answering
stages. Mirroring human cognition, we decompose the responsibility of each
stage into distinct experts and a cognition-enabled component (CC). The two CCs
strategically execute one expert for each stage at a time. Finally, we
prioritize answer predictions governed by pathways involving both CCs while
disregarding answers produced by either CC, thereby emphasizing causal
reasoning and supporting generalization. Our experiments on real-life and
medical data consistently verify that CopVQA improves VQA performance and
generalization across baselines and domains. Notably, CopVQA achieves a new
state-of-the-art (SOTA) on PathVQA dataset and comparable accuracy to the
current SOTA on VQA-CPv2, VQAv2, and VQA RAD, with one-fourth of the model
size.
Related papers
- II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering [15.65067042725113]
We propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in Visual Question Answering (VQA)
II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings.
II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.
arXiv Detail & Related papers (2024-02-16T20:14:47Z) - VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization [15.554325659263316]
Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities.
Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts.
We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline.
arXiv Detail & Related papers (2023-11-01T19:43:56Z) - From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities [2.0681376988193843]
The work presents a survey in the domain of Visual Question Answering (VQA) that delves into the intricacies of VQA datasets and methods over the field's history.
We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation.
arXiv Detail & Related papers (2023-11-01T05:39:41Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Co-VQA : Answering by Interactive Sub Question Sequence [18.476819557695087]
This paper proposes a conversation-based VQA framework, which consists of three components: Questioner, Oracle, and Answerer.
To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets.
arXiv Detail & Related papers (2022-04-02T15:09:16Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.