Beyond VQA: Generating Multi-word Answer and Rationale to Visual
Questions
- URL: http://arxiv.org/abs/2010.12852v2
- Date: Thu, 17 Jun 2021 09:44:12 GMT
- Title: Beyond VQA: Generating Multi-word Answer and Rationale to Visual
Questions
- Authors: Radhika Dua, Sai Srinivas Kancheti and Vineeth N Balasubramanian
- Abstract summary: We introduce ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer.
We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.
- Score: 27.807568245576718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering is a multi-modal task that aims to measure
high-level visual understanding. Contemporary VQA models are restrictive in the
sense that answers are obtained via classification over a limited vocabulary
(in the case of open-ended VQA), or via classification over a set of
multiple-choice-type answers. In this work, we present a completely generative
formulation where a multi-word answer is generated for a visual query. To take
this a step forward, we introduce a new task: ViQAR (Visual Question Answering
and Reasoning), wherein a model must generate the complete answer and a
rationale that seeks to justify the generated answer. We propose an end-to-end
architecture to solve this task and describe how to evaluate it. We show that
our model generates strong answers and rationales through qualitative and
quantitative evaluation, as well as through a human Turing Test.
Related papers
- Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - Long-form Question Answering: An Iterative Planning-Retrieval-Generation
Approach [28.849548176802262]
Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs.
We propose an LFQA model with iterative Planning, Retrieval, and Generation.
We find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.
arXiv Detail & Related papers (2023-11-15T21:22:27Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Co-VQA : Answering by Interactive Sub Question Sequence [18.476819557695087]
This paper proposes a conversation-based VQA framework, which consists of three components: Questioner, Oracle, and Answerer.
To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets.
arXiv Detail & Related papers (2022-04-02T15:09:16Z) - AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer
Summarization [73.91543616777064]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions.
One goal of answer summarization is to produce a summary that reflects the range of answer perspectives.
This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists.
arXiv Detail & Related papers (2021-11-11T21:48:02Z) - Multi-Perspective Abstractive Answer Summarization [76.10437565615138]
Community Question Answering forums contain a rich resource of answers to a wide range of questions.
The goal of multi-perspective answer summarization is to produce a summary that includes all perspectives of the answer.
This work introduces a novel dataset creation method to automatically create multi-perspective, bullet-point abstractive summaries.
arXiv Detail & Related papers (2021-04-17T13:15:29Z) - Answer-checking in Context: A Multi-modal FullyAttention Network for
Visual Question Answering [8.582218033859087]
We propose a fully attention based Visual Question Answering architecture.
An answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation.
Our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
arXiv Detail & Related papers (2020-10-17T03:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.