Related papers: Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

URL: http://arxiv.org/abs/2311.17331v2
Date: Tue, 14 May 2024 06:19:08 GMT
Title: Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Authors: Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Keze Wang, Liang Lin,
Abstract summary: Several methods have been proposed to augment large Vision Language Models (VLMs) for Visual Question Answering (VQA) simplicity. We introduce a novel, explainable multi-agent collaboration framework designed to imitate human-like top-down reasoning.
Score: 45.88079503965459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, several methods have been proposed to augment large Vision Language Models (VLMs) for Visual Question Answering (VQA) simplicity by incorporating external knowledge from knowledge bases or visual clues derived from question decomposition. Although having achieved promising results, these methods still suffer from the challenge that VLMs cannot inherently understand the incorporated knowledge and might fail to generate the optimal answers. Contrarily, human cognition engages visual questions through a top-down reasoning process, systematically exploring relevant issues to derive a comprehensive answer. This not only facilitates an accurate answer but also provides a transparent rationale for the decision-making pathway. Motivated by this cognitive mechanism, we introduce a novel, explainable multi-agent collaboration framework designed to imitate human-like top-down reasoning by leveraging the expansive knowledge of Large Language Models (LLMs). Our framework comprises three agents, i.e., Responder, Seeker, and Integrator, each contributing uniquely to the top-down reasoning process. The VLM-based Responder generates the answer candidates for the question and gives responses to other issues. The Seeker, primarily based on LLM, identifies relevant issues related to the question to inform the Responder and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the understanding capabilities of LLM. The Integrator agent combines information from the Seeker and the Responder to produce the final VQA answer. Through this collaboration mechanism, our framework explicitly constructs an MVKB for a specific visual scene and reasons answers in a top-down reasoning process. Extensive and comprehensive evaluations on diverse VQA datasets and VLMs demonstrate the superior applicability and interpretability of our framework over the existing compared methods.

Related papers

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training [16.14877145354785]
We propose a Unified Retrieval-Augmented VQA framework (UniRVQA) for knowledge-intensive visual questions. UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.
arXiv Detail & Related papers (2025-04-05T05:42:12Z)
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets [9.67464173044675]
Visual Question Answering (VQA) is the task of answering a question about an image. We present an approach for declarative knowledge distillation from Large Language Models (LLMs) Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
arXiv Detail & Related papers (2024-10-12T08:17:03Z)
AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit [59.10281630985958]
In question answering (QA), different questions can be effectively addressed with different answering strategies. We develop a dynamic method that adaptively selects the most suitable QA strategy for each question. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules.
arXiv Detail & Related papers (2024-09-20T12:28:18Z)
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs [6.879945062426145]
We introduce SK-VQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs.<n>Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance.<n>Our results indicate that models trained on SK-VQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
arXiv Detail & Related papers (2024-06-28T01:14:43Z)
Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation [34.45251681923171]
This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs) We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
arXiv Detail & Related papers (2024-01-18T14:21:56Z)
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge. As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z)
Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models. We generate self-contained questions as reasoning question prompts via an unsupervised question edition module. Each reasoning question prompt clearly indicates the intent of the original question. Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z)
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.