Check It Again: Progressive Visual Question Answering via Visual
Entailment
- URL: http://arxiv.org/abs/2106.04605v1
- Date: Tue, 8 Jun 2021 18:00:38 GMT
- Title: Check It Again: Progressive Visual Question Answering via Visual
Entailment
- Authors: Qingyi Si, Zheng Lin, Mingyu Zheng, Peng Fu, Weiping Wang
- Abstract summary: We propose a select-and-rerank (SAR) progressive framework based on Visual Entailment.
We first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task.
Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.
- Score: 12.065178204539693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While sophisticated Visual Question Answering models have achieved remarkable
success, they tend to answer questions only according to superficial
correlations between question and answer. Several recent approaches have been
developed to address this language priors problem. However, most of them
predict the correct answer according to one best output without checking the
authenticity of answers. Besides, they only explore the interaction between
image and question, ignoring the semantics of candidate answers. In this paper,
we propose a select-and-rerank (SAR) progressive framework based on Visual
Entailment. Specifically, we first select the candidate answers relevant to the
question or the image, then we rerank the candidate answers by a visual
entailment task, which verifies whether the image semantically entails the
synthetic statement of the question and each candidate answer. Experimental
results show the effectiveness of our proposed framework, which establishes a
new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - VQA Therapy: Exploring Answer Differences by Visually Grounding Answers [21.77545853313608]
We introduce the first dataset that visually grounds each unique answer to each visual question.
We then propose two novel problems of predicting whether a visual question has a single answer grounding.
arXiv Detail & Related papers (2023-08-21T18:57:21Z) - Answering Ambiguous Questions with a Database of Questions, Answers, and
Revisions [95.92276099234344]
We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia.
Our method improves performance by 15% on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs.
arXiv Detail & Related papers (2023-08-16T20:23:16Z) - Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist.
One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity.
We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z) - Weakly Supervised Visual Question Answer Generation [2.7605547688813172]
We present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions.
We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperforms SOTA methods on BLEU scores.
arXiv Detail & Related papers (2023-06-11T08:46:42Z) - Double Retrieval and Ranking for Accurate Question Answering [120.69820139008138]
We show that an answer verification step introduced in Transformer-based answer selection models can significantly improve the state of the art in Question Answering.
The results on three well-known datasets for AS2 show consistent and significant improvement of the state of the art.
arXiv Detail & Related papers (2022-01-16T06:20:07Z) - Graph-Based Tri-Attention Network for Answer Ranking in CQA [56.42018099917321]
We propose a novel graph-based tri-attention network, namely GTAN, to generate answer ranking scores.
Experiments on three real-world CQA datasets demonstrate GTAN significantly outperforms state-of-the-art answer ranking methods.
arXiv Detail & Related papers (2021-03-05T10:40:38Z) - Answer-checking in Context: A Multi-modal FullyAttention Network for
Visual Question Answering [8.582218033859087]
We propose a fully attention based Visual Question Answering architecture.
An answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation.
Our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
arXiv Detail & Related papers (2020-10-17T03:37:16Z) - Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue [42.563261906213455]
We propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states.
First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention.
Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF)
arXiv Detail & Related papers (2020-10-01T12:46:38Z) - Rephrasing visual questions by specifying the entropy of the answer
distribution [0.0]
We propose a novel task, rephrasing the questions by controlling the ambiguity of the questions.
The ambiguity of a visual question is defined by the use of the entropy of the answer distribution predicted by a VQA model.
We demonstrate the advantage of our approach that can control the ambiguity of the rephrased questions, and an interesting observation that it is harder to increase than to reduce ambiguity.
arXiv Detail & Related papers (2020-04-10T09:32:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.