Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2508.08816v1
- Date: Tue, 12 Aug 2025 10:17:12 GMT
- Title: Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented Generation
- Authors: Yuechen Wang, Yuming Qiao, Dan Meng, Jun Yang, Haonan Lu, Zhenyu Yang, Xudong Zhang,
- Abstract summary: Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios.<n>We propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning and a task executor employing tool-aware execution sequencing.
- Score: 17.115587821286223
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark's explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent's superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.
Related papers
- PerfGuard: A Performance-Aware Agent for Visual Content Generation [53.591105729011595]
PerfGuard is a performance-aware agent framework for visual content generation.<n>It integrates tool performance boundaries into task planning and scheduling.<n>It has advantages in tool selection accuracy, execution reliability, and alignment with user intent.
arXiv Detail & Related papers (2026-01-30T05:12:19Z) - Efficient Multimodal Planning Agent for Visual Question-Answering [67.26245301307539]
This paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task.<n>In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods.
arXiv Detail & Related papers (2026-01-28T14:58:59Z) - Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks [0.0]
We introduce RP-ReAct, a novel multi-agent approach that decouples strategic planning from low-level execution to achieve superior reliability and efficiency.<n>RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions.<n>We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models.
arXiv Detail & Related papers (2025-12-03T08:28:40Z) - Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z) - DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling [56.45844907505722]
We propose DecoupleSearch, a framework that decouples planning and search processes using dual value models.<n>Our approach constructs a reasoning tree, where each node represents planning and search steps.<n>During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models.
arXiv Detail & Related papers (2025-09-07T13:45:09Z) - RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory [57.449129198822476]
RCR is a role-aware context routing framework for multi-agent large language model (LLM) systems.<n>It dynamically selects semantically relevant memory subsets for each agent based on its role and task stage.<n>A lightweight scoring policy guides memory selection, and agent outputs are integrated into a shared memory store.
arXiv Detail & Related papers (2025-08-06T21:59:34Z) - HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search [85.12447821237045]
HiRA is a hierarchical framework that separates strategic planning from specialized execution.<n>Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities.<n> Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems.
arXiv Detail & Related papers (2025-07-03T14:18:08Z) - LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z) - ImpRAG: Retrieval-Augmented Generation with Implicit Queries [49.510101132093396]
ImpRAG is a query-free RAG system that integrates retrieval and generation into a unified model.<n>We show that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats.
arXiv Detail & Related papers (2025-06-02T21:38:21Z) - MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning [43.66966457772646]
MA-RAG orchestrates a collaborative set of specialized AI agents to tackle each stage of the RAG pipeline with task-aware reasoning.<n>Our design allows fine-grained control over information flow without any model fine-tuning.<n>This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results.
arXiv Detail & Related papers (2025-05-26T15:05:18Z) - InstructRAG: Leveraging Retrieval-Augmented Generation on Instruction Graphs for LLM-Based Task Planning [6.75641900721385]
Large language models (LLMs) have enabled their use as agents for planning complex tasks.<n>Retrieval-augmented generation (RAG) offers new opportunities by leveraging external databases to ground generation in retrieved information.<n>We propose InstructRAG, a novel solution within a multi-agent meta-reinforcement learning framework to address these challenges.
arXiv Detail & Related papers (2025-04-17T15:41:39Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z) - Unveiling the Potential of Multimodal Retrieval Augmented Generation with Planning [5.205803766626321]
Multimodal Retrieval Augmented Generation (MRAG) systems often rely on rigid, single-step retrieval methods.<n>We present CogPlanner, a versatile framework inspired by human cognitive processes.<n>CogPlanner iteratively refines queries and selects retrieval strategies, enabling both parallel and sequential modeling approaches.
arXiv Detail & Related papers (2025-01-26T10:16:42Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation [20.5047654554575]
Plan*RAG is a framework that enables structured multi-hop reasoning in retrieval-augmented generation (RAG)<n>Plan*RAG consistently achieves improvements over recently proposed methods such as RQ-RAG and Self-RAG.
arXiv Detail & Related papers (2024-10-28T05:35:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.