Towards a Unified Model for Generating Answers and Explanations in
Visual Question Answering
- URL: http://arxiv.org/abs/2301.10799v1
- Date: Wed, 25 Jan 2023 19:29:19 GMT
- Title: Towards a Unified Model for Generating Answers and Explanations in
Visual Question Answering
- Authors: Chenxi Whitehouse, Tillman Weyde, Pranava Madhyastha
- Abstract summary: We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance.
We propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations.
- Score: 11.754328280233628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Providing explanations for visual question answering (VQA) has gained much
attention in research. However, most existing systems use separate models for
predicting answers and providing explanations. We argue that training
explanation models independently of the QA model makes the explanations less
grounded and limits performance. To address this, we propose a multitask
learning approach towards a Unified Model for more grounded and consistent
generation of both Answers and Explanations (UMAE). To achieve this, we add
artificial prompt tokens to training instances and finetune a multimodal
encoder-decoder model on various VQA tasks. In our experiments, UMAE models
surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive
results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and
demonstrate promising out-of-domain performance on VQA-X.
Related papers
- Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering [2.98667511228225]
ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder.
ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE with more persuasive, reliability.
arXiv Detail & Related papers (2024-08-30T04:39:43Z) - Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion [6.9879884952138065]
The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model.
A ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy.
Our model significantly outperforms existing state-of-the-art models on standard VQA datasets.
arXiv Detail & Related papers (2024-08-14T05:18:43Z) - Towards a performance analysis on pre-trained Visual Question Answering
models for autonomous driving [2.9552300389898094]
This paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT.
The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts.
arXiv Detail & Related papers (2023-07-18T15:11:40Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.