Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering
- URL: http://arxiv.org/abs/2406.01402v1
- Date: Mon, 3 Jun 2024 15:04:47 GMT
- Title: Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering
- Authors: Tao Li, Linjun Shou, Xuejun Liu,
- Abstract summary: We propose emphMixture of Rationales (MoR), a novel multi-modal reasoning method that mixes multiple rationales for zero-shot visual question answering.
MoR achieves a 12.43% accuracy improvement on NLVR2, and a 2.45% accuracy improvement on OKVQA-S.
- Score: 19.351516992903697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot visual question answering (VQA) is a challenging task that requires reasoning across modalities. While some existing methods rely on a single rationale within the Chain of Thoughts (CoT) framework, they may fall short of capturing the complexity of the VQA problem. On the other hand, some other methods that use multiple rationales may still suffer from low diversity, poor modality alignment, and inefficient retrieval and fusion. In response to these challenges, we propose \emph{Mixture of Rationales (MoR)}, a novel multi-modal reasoning method that mixes multiple rationales for VQA. MoR uses a single frozen Vision-and-Language Pre-trained Models (VLPM) model to {dynamically generate, retrieve and fuse multi-modal thoughts}. We evaluate MoR on two challenging VQA datasets, i.e. NLVR2 and OKVQA, with two representative backbones OFA and VL-T5. MoR achieves a 12.43\% accuracy improvement on NLVR2, and a 2.45\% accuracy improvement on OKVQA-S( the science and technology category of OKVQA).
Related papers
- Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - Modularized Zero-shot VQA with Pre-trained Models [20.674979268279728]
We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable.
Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-05-27T05:00:14Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual
Reasoning [34.6700781893352]
We introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated.
Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality.
With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data.
arXiv Detail & Related papers (2022-12-01T03:53:24Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - Logically Consistent Loss for Visual Question Answering [66.83963844316561]
The current advancement in neural-network based Visual Question Answering (VQA) cannot ensure such consistency due to identically distribution (i.i.d.) assumption.
We propose a new model-agnostic logic constraint to tackle this issue by formulating a logically consistent loss in the multi-task learning framework.
Experiments confirm that the proposed loss formulae and introduction of hybrid-batch leads to more consistency as well as better performance.
arXiv Detail & Related papers (2020-11-19T20:31:05Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.