Related papers: ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning

ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning

URL: http://arxiv.org/abs/2512.18571v1
Date: Sun, 21 Dec 2025 02:45:08 GMT
Title: ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
Authors: Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu, Wei Li, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang, Zhengyou Zhang,
Abstract summary: ESearch-R1 is a cost-aware embodied reasoning framework.<n>It unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory) and physical navigation (Navigate) into a single decision process.<n>It improves task success rates while reducing total operational costs by approximately 50%.
Score: 40.2017873619555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., "fetch the tool" in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50\%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.

Related papers

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents [20.119608534884858]
We propose a training-free paradigm to empower Vision-Language Models with autonomous search capabilities.<n>By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data.
arXiv Detail & Related papers (2026-03-02T03:43:31Z)
Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue [28.25180116201176]
We propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process.<n>We first establish a User-centric Interaction Framework to provide a high-fidelity training gym.<n>Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy.
arXiv Detail & Related papers (2026-02-26T07:19:57Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning [40.6234318894435]
Large language models split into two families: reasoning-centric LLMs and agentic LLMs.<n>This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries.<n>We present Adaptive Agent Foundation Model (A$2$FM), a unified framework that follows a route-then-align principle.
arXiv Detail & Related papers (2025-10-13T17:08:25Z)
The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management [2.582081036460148]
Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use.<n>We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified.<n>We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization.
arXiv Detail & Related papers (2025-08-29T09:02:35Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively [13.40488551654639]
We introduce the 3E Criteria to assess the cost-effectiveness of search strategies.<n>We propose the Speculative Reward Model (SRM), a plug-and-play framework that integrates seamlessly with existing search strategies.<n> Experimental results show that RM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.
arXiv Detail & Related papers (2025-05-31T05:32:12Z)
The Real Barrier to LLM Agent Usability is Agentic ROI [110.31127571114635]
Large Language Model (LLM) agents represent a promising shift in human-AI interaction.<n>We highlight a critical usability gap in high-demand, mass-market applications.
arXiv Detail & Related papers (2025-05-23T11:40:58Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment [52.592199835286394]
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts. Large language models (LLMs) have exhibited remarkable in-context learning capabilities.
arXiv Detail & Related papers (2023-10-16T17:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.