Convincing Rationales for Visual Question Answering Reasoning
- URL: http://arxiv.org/abs/2402.03896v1
- Date: Tue, 6 Feb 2024 11:07:05 GMT
- Title: Convincing Rationales for Visual Question Answering Reasoning
- Authors: Kun Li, George Vosselman, Michael Ying Yang
- Abstract summary: Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image.
To generate both visual and textual rationales next to the predicted answer to a given image/question pair, we propose Convincing Rationales for VQA, CRVQA.
CRVQA achieves competitive performance on generic VQA datatsets in the zero-shot evaluation setting.
- Score: 14.490692389105947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) is a challenging task of predicting the
answer to a question about the content of an image. It requires deep
understanding of both the textual question and visual image. Prior works
directly evaluate the answering models by simply calculating the accuracy of
the predicted answers. However, the inner reasoning behind the prediction is
disregarded in such a "black box" system, and we do not even know if one can
trust the predictions. In some cases, the models still get the correct answers
even when they focus on irrelevant visual regions or textual tokens, which
makes the models unreliable and illogical. To generate both visual and textual
rationales next to the predicted answer to the given image/question pair, we
propose Convincing Rationales for VQA, CRVQA. Considering the extra annotations
brought by the new outputs, {CRVQA} is trained and evaluated by samples
converted from some existing VQA datasets and their visual labels. The
extensive experiments demonstrate that the visual and textual rationales
support the prediction of the answers, and further improve the accuracy.
Furthermore, {CRVQA} achieves competitive performance on generic VQA datatsets
in the zero-shot evaluation setting. The dataset and source code will be
released under https://github.com/lik1996/CRVQA2024.
Related papers
- Fully Authentic Visual Question Answering Dataset from Online Communities [72.0524198499719]
Visual Question Answering (VQA) entails answering questions about images.
We introduce the first VQA dataset in which all contents originate from an authentic use case.
We characterize this dataset and how it relates to eight mainstream VQA datasets.
arXiv Detail & Related papers (2023-11-27T06:19:00Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - IQ-VQA: Intelligent Visual Question Answering [3.09911862091928]
We show that our framework improves consistency of VQA models by 15% on the rule-based dataset.
We also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
arXiv Detail & Related papers (2020-07-08T20:41:52Z) - Generating Rationales in Visual Question Answering [28.45552957339557]
We propose a new task ofrationale generation for Visual QuestionAnswering (VQA)
We use data from the Visual Commonsense Rea-soning (VCR) task, as it contains ground-truths along with visual questions and an-swers.
We train ViLBERT with GPT-2 in an end-to-endfashion with the dual task of predicting the an-swer in VQA and generating rationales.
arXiv Detail & Related papers (2020-04-04T22:15:35Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.