Efficient Multimodal Planning Agent for Visual Question-Answering
- URL: http://arxiv.org/abs/2601.20676v1
- Date: Wed, 28 Jan 2026 14:58:59 GMT
- Title: Efficient Multimodal Planning Agent for Visual Question-Answering
- Authors: Zhuo Chen, Xinyu Geng, Xinyu Wang, Yong Jiang, Zhen Zhang, Pengjun Xie, Kewei Tu,
- Abstract summary: This paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task.<n>In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods.
- Score: 67.26245301307539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.
Related papers
- Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search [70.63903518295785]
We introduce RepoSearch-R1, a novel agentic reinforcement learning framework driven by Monte-carlo Tree Search.<n>Based on RepoSearch-R1, we construct a RepoQA-Agent specifically designed for repository question-answering tasks.
arXiv Detail & Related papers (2025-10-30T09:10:36Z) - Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation [17.115587821286223]
Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios.<n>We propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning and a task executor employing tool-aware execution sequencing.
arXiv Detail & Related papers (2025-08-12T10:17:12Z) - QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering [27.567923098020586]
We propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA.<n>By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning.<n>We evaluate our framework on the Meta CRAG-MM Challenge at KDD Cup 2025.
arXiv Detail & Related papers (2025-08-07T09:32:49Z) - MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation [35.853052535353775]
In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues.<n>We propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration.
arXiv Detail & Related papers (2025-08-01T18:15:22Z) - Knowledge-Aware Iterative Retrieval for Multi-Agent Systems [0.0]
We introduce a novel large language model (LLM)-driven agent framework.<n>It iteratively refines queries and filters contextual evidence by leveraging dynamically evolving knowledge.<n>The proposed system supports both competitive and collaborative sharing of updated context.
arXiv Detail & Related papers (2025-03-17T15:27:02Z) - Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Agent-Oriented Planning in Multi-Agent Systems [54.429028104022066]
We propose AOP, a novel framework for agent-oriented planning in multi-agent systems.<n>In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy.<n> Extensive experiments demonstrate the advancement of AOP in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.
arXiv Detail & Related papers (2024-10-03T04:07:51Z) - AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit [59.10281630985958]
In question answering (QA), different questions can be effectively addressed with different answering strategies.
We develop a dynamic method that adaptively selects the most suitable QA strategy for each question.
Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules.
arXiv Detail & Related papers (2024-09-20T12:28:18Z) - Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering [0.0]
In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance.
Recent advancements in Generative AI have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA)
We propose a secure, on-premises enterprise solution using a hierarchical, multi-agent Retrieval Augmented Generation (RAG) framework.
arXiv Detail & Related papers (2024-08-24T19:34:04Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.