ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
- URL: http://arxiv.org/abs/2512.18571v1
- Date: Sun, 21 Dec 2025 02:45:08 GMT
- Title: ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
- Authors: Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu, Wei Li, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang, Zhengyou Zhang,
- Abstract summary: ESearch-R1 is a cost-aware embodied reasoning framework.<n>It unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory) and physical navigation (Navigate) into a single decision process.<n>It improves task success rates while reducing total operational costs by approximately 50%.
- Score: 40.2017873619555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., "fetch the tool" in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50\%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.
Related papers
- Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents [20.119608534884858]
We propose a training-free paradigm to empower Vision-Language Models with autonomous search capabilities.<n>By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data.
arXiv Detail & Related papers (2026-03-02T03:43:31Z) - Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue [28.25180116201176]
We propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process.<n>We first establish a User-centric Interaction Framework to provide a high-fidelity training gym.<n>Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy.
arXiv Detail & Related papers (2026-02-26T07:19:57Z) - SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning [40.6234318894435]
Large language models split into two families: reasoning-centric LLMs and agentic LLMs.<n>This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries.<n>We present Adaptive Agent Foundation Model (A$2$FM), a unified framework that follows a route-then-align principle.
arXiv Detail & Related papers (2025-10-13T17:08:25Z) - The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management [2.582081036460148]
Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use.<n>We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified.<n>We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization.
arXiv Detail & Related papers (2025-08-29T09:02:35Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively [13.40488551654639]
We introduce the 3E Criteria to assess the cost-effectiveness of search strategies.<n>We propose the Speculative Reward Model (SRM), a plug-and-play framework that integrates seamlessly with existing search strategies.<n> Experimental results show that RM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.
arXiv Detail & Related papers (2025-05-31T05:32:12Z) - The Real Barrier to LLM Agent Usability is Agentic ROI [110.31127571114635]
Large Language Model (LLM) agents represent a promising shift in human-AI interaction.<n>We highlight a critical usability gap in high-demand, mass-market applications.
arXiv Detail & Related papers (2025-05-23T11:40:58Z) - Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z) - Mastering the Task of Open Information Extraction with Large Language
Models and Consistent Reasoning Environment [52.592199835286394]
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts.
Large language models (LLMs) have exhibited remarkable in-context learning capabilities.
arXiv Detail & Related papers (2023-10-16T17:11:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.