Answer-checking in Context: A Multi-modal FullyAttention Network for
Visual Question Answering
- URL: http://arxiv.org/abs/2010.08708v1
- Date: Sat, 17 Oct 2020 03:37:16 GMT
- Title: Answer-checking in Context: A Multi-modal FullyAttention Network for
Visual Question Answering
- Authors: Hantao Huang, Tao Han, Wei Han, Deep Yap, Cheng-Ming Chiang
- Abstract summary: We propose a fully attention based Visual Question Answering architecture.
An answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation.
Our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
- Score: 8.582218033859087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) is challenging due to the complex cross-modal
relations. It has received extensive attention from the research community.
From the human perspective, to answer a visual question, one needs to read the
question and then refer to the image to generate an answer. This answer will
then be checked against the question and image again for the final
confirmation. In this paper, we mimic this process and propose a fully
attention based VQA architecture. Moreover, an answer-checking module is
proposed to perform a unified attention on the jointly answer, question and
image representation to update the answer. This mimics the human answer
checking process to consider the answer in the context. With answer-checking
modules and transferred BERT layers, our model achieves the state-of-the-art
accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist.
One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity.
We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z) - Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering [7.3532068640624395]
We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context.
We propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation.
Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA.
arXiv Detail & Related papers (2023-06-29T06:22:43Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - Co-VQA : Answering by Interactive Sub Question Sequence [18.476819557695087]
This paper proposes a conversation-based VQA framework, which consists of three components: Questioner, Oracle, and Answerer.
To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets.
arXiv Detail & Related papers (2022-04-02T15:09:16Z) - Check It Again: Progressive Visual Question Answering via Visual
Entailment [12.065178204539693]
We propose a select-and-rerank (SAR) progressive framework based on Visual Entailment.
We first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task.
Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.
arXiv Detail & Related papers (2021-06-08T18:00:38Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Beyond VQA: Generating Multi-word Answer and Rationale to Visual
Questions [27.807568245576718]
We introduce ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer.
We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.
arXiv Detail & Related papers (2020-10-24T09:44:50Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.