Related papers: Joint Answering and Explanation for Visual Commonsense Reasoning

Joint Answering and Explanation for Visual Commonsense Reasoning

URL: http://arxiv.org/abs/2202.12626v1
Date: Fri, 25 Feb 2022 11:26:52 GMT
Title: Joint Answering and Explanation for Visual Commonsense Reasoning
Authors: Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, Mohan Kankanhalli
Abstract summary: Visual Commonsense Reasoning endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. We present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes.
Score: 46.44588492897933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of this treatment. Based on our findings, in this paper, we present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution is the introduction of a novel branch, which serves as the bridge to conduct processes connecting. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As detailed in the experimental results, when equipped with our framework, these baselines achieve consistent and significant performance improvements, demonstrating the viability of processes coupling, as well as the superiority of the proposed framework.

Related papers

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger [51.01841635655944]
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks.<n>Existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge.<n>We propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method.
arXiv Detail & Related papers (2025-06-09T14:00:57Z)
Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions. Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z)
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions. Specifically, a well-defined detector is adopted to predict image-question related relation phrases. The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z)
Strong and Efficient Baselines for Open Domain Conversational Question Answering [2.773656427800412]
We study the State-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and Fusion-in-Decoder (FiD) reader pipeline. We propose and evaluate strong yet simple and efficient baselines, by introducing a fast reranking component between the retriever and the reader. Experiments on two ODConvQA tasks, namely TopiOCQA and OR-QuAC, show that our method improves the SotA results, while reducing reader's latency by 60%.
arXiv Detail & Related papers (2023-10-23T08:48:14Z)
Building Interpretable and Reliable Open Information Retriever for New Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA) We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query. We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z)
LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images. We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)
Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR) Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention. Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z)
Learning to Agree on Vision Attention for Visual Commonsense Reasoning [50.904275811951614]
A VCR model aims at answering a question regarding an image, followed by the rationale prediction for the preceding answering process. Existing methods ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework.
arXiv Detail & Related papers (2023-02-04T07:02:29Z)
ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries. We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations. Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z)
Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task. Our method first extracts the features and predicates from the image and question. We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.