Related papers: CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning

CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning

URL: http://arxiv.org/abs/2501.15470v2
Date: Fri, 31 Oct 2025 03:52:49 GMT
Title: CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning
Authors: Xiaohan Yu, Zhihan Yang, Chong Chen,
Abstract summary: Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs)<n>Existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation.<n>We introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead.
Score: 9.027579000292441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs). However, existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation. In this work, we introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead. Specifically, we propose CogPlanner, an agentic plug-and-play framework inspired by human cognitive processes, which iteratively determines query reformulation and retrieval strategies to generate accurate and contextually relevant responses. CogPlanner supports parallel and sequential modeling paradigms. Furthermore, we introduce CogBench, a new benchmark designed to rigorously evaluate the MRAG Planning task and facilitate lightweight CogPlanner integration with resource-efficient MLLMs, such as Qwen2-VL-7B-Cog. Experimental results demonstrate that CogPlanner significantly outperforms existing MRAG baselines, offering improvements in both accuracy and efficiency with minimal additional computational costs.

Related papers

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning [31.679428422518082]
We propose a Planner-centric Plan-Execute paradigm to resolve local optimization bottlenecks.<n>A novel Planner model performs global Directed Acyclic Graph (DAG) planning for complex queries.<n>We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries.<n>When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries.
arXiv Detail & Related papers (2025-11-13T07:22:27Z)
ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration [25.45699736192177]
Large Language Models (LLMs) enable intelligent multi-robot collaboration but face fundamental trade-offs.<n>We propose ELHPlan, a novel framework that introduces Action Chains--sequences of actions explicitly bound to sub-goal intentions.
arXiv Detail & Related papers (2025-09-29T03:15:56Z)
DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling [56.45844907505722]
We propose DecoupleSearch, a framework that decouples planning and search processes using dual value models.<n>Our approach constructs a reasoning tree, where each node represents planning and search steps.<n>During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models.
arXiv Detail & Related papers (2025-09-07T13:45:09Z)
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z)
CRISP: Complex Reasoning with Interpretable Step-based Plans [15.656686375199921]
We introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a dataset of high-level plans for mathematical reasoning and code generation.<n>We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting.
arXiv Detail & Related papers (2025-07-09T11:40:24Z)
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification [5.727096041675994]
Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks.<n>We propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation.
arXiv Detail & Related papers (2025-04-06T18:36:30Z)
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving [89.60370366013142]
We propose PlanGEN, a model-agnostic and easily scalable agent framework with three key components: constraint, verification, and selection agents.<n>Specifically, our approach proposes constraint-guided iterative verification to enhance performance of inference-time algorithms.
arXiv Detail & Related papers (2025-02-22T06:21:56Z)
Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z)
Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation [20.5047654554575]
Plan*RAG is a framework that enables structured multi-hop reasoning in retrieval-augmented generation (RAG)<n>Plan*RAG consistently achieves improvements over recently proposed methods such as RQ-RAG and Self-RAG.
arXiv Detail & Related papers (2024-10-28T05:35:04Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs) LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z)
Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning [10.704716790096498]
Large language models (LLMs) have demonstrated impressive task-solving capabilities through prompting techniques and system designs. For planning tasks with limited prior data, the performance of LLMs, including proprietary models like GPT and Gemini, is poor. This paper investigates the impact of fine-tuning on the planning capabilities of LLMs.
arXiv Detail & Related papers (2024-06-15T03:06:14Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Improving Planning with Large Language Models: A Modular Agentic Architecture [7.63815864256878]
Large language models (LLMs) often struggle with tasks that require multi-step reasoning or goal-directed planning. We propose an agentic architecture, the Modular Agentic Planner (MAP), in which planning is accomplished via the recurrent interaction of specialized modules. We find that MAP yields significant improvements over both standard LLM methods.
arXiv Detail & Related papers (2023-09-30T00:10:14Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.