Joint Answering and Explanation for Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2202.12626v1
- Date: Fri, 25 Feb 2022 11:26:52 GMT
- Title: Joint Answering and Explanation for Visual Commonsense Reasoning
- Authors: Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, Mohan
Kankanhalli
- Abstract summary: Visual Commonsense Reasoning endeavors to pursue a more high-level visual comprehension.
It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation.
We present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes.
- Score: 46.44588492897933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Commonsense Reasoning (VCR), deemed as one challenging extension of
the Visual Question Answering (VQA), endeavors to pursue a more high-level
visual comprehension. It is composed of two indispensable processes: question
answering over a given image and rationale inference for answer explanation.
Over the years, a variety of methods tackling VCR have advanced the performance
on the benchmark dataset. Despite significant as these methods are, they often
treat the two processes in a separate manner and hence decompose the VCR into
two irrelevant VQA instances. As a result, the pivotal connection between
question answering and rationale inference is interrupted, rendering existing
efforts less faithful on visual reasoning. To empirically study this issue, we
perform some in-depth explorations in terms of both language shortcuts and
generalization capability to verify the pitfalls of this treatment. Based on
our findings, in this paper, we present a plug-and-play knowledge distillation
enhanced framework to couple the question answering and rationale inference
processes. The key contribution is the introduction of a novel branch, which
serves as the bridge to conduct processes connecting. Given that our framework
is model-agnostic, we apply it to the existing popular baselines and validate
its effectiveness on the benchmark dataset. As detailed in the experimental
results, when equipped with our framework, these baselines achieve consistent
and significant performance improvements, demonstrating the viability of
processes coupling, as well as the superiority of the proposed framework.
Related papers
- Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.
Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z) - Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions.
Specifically, a well-defined detector is adopted to predict image-question related relation phrases.
The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z) - Strong and Efficient Baselines for Open Domain Conversational Question
Answering [2.773656427800412]
We study the State-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and Fusion-in-Decoder (FiD) reader pipeline.
We propose and evaluate strong yet simple and efficient baselines, by introducing a fast reranking component between the retriever and the reader.
Experiments on two ODConvQA tasks, namely TopiOCQA and OR-QuAC, show that our method improves the SotA results, while reducing reader's latency by 60%.
arXiv Detail & Related papers (2023-10-23T08:48:14Z) - Building Interpretable and Reliable Open Information Retriever for New
Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA)
We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query.
We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - Learning to Agree on Vision Attention for Visual Commonsense Reasoning [50.904275811951614]
A VCR model aims at answering a question regarding an image, followed by the rationale prediction for the preceding answering process.
Existing methods ignore the pivotal relationship between the two processes, leading to sub-optimal model performance.
This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework.
arXiv Detail & Related papers (2023-02-04T07:02:29Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.