ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
- URL: http://arxiv.org/abs/2510.12693v1
- Date: Tue, 14 Oct 2025 16:25:46 GMT
- Title: ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
- Authors: Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang,
- Abstract summary: We present textitEmbodied Reasoning Agent (ERA), a framework that integrates prior knowledge learning and online reinforcement learning.<n>ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
- Score: 73.35191368656224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
Related papers
- Scaling Reinforcement Learning for Content Moderation with Large Language Models [16.516137166093696]
We present a comprehensive empirical investigation of scaling reinforcement learning for content classification.<n>We show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning.
arXiv Detail & Related papers (2025-12-23T05:27:16Z) - On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models [73.10315509190623]
Recent reinforcement learning techniques have yielded impressive reasoning improvements in language models.<n>It remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training.<n>We develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training.
arXiv Detail & Related papers (2025-12-08T18:12:10Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning [6.665920297143511]
Agentic-KGR is a novel framework enabling co-evolution between large language models (LLMs) and knowledge graphs (KGs)<n>Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; and (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization.
arXiv Detail & Related papers (2025-10-10T09:00:07Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Agentic Episodic Control [16.94652073521156]
Reinforcement learning (RL) has driven breakthroughs in AI, from game-play to scientific discovery and AI alignment.<n>Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task-agnostic planning.<n>We propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with large language models to enhance decision-making.
arXiv Detail & Related papers (2025-06-02T08:57:37Z) - Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model [39.58344147240552]
We investigate whether large vision-language models (VLMs) can compose capabilities across modalities or tasks under out-of-distribution conditions.<n>Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
arXiv Detail & Related papers (2025-05-26T01:42:38Z) - LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [14.089700378708756]
We introduce textbfUP-VLA, a textbfUnified VLA model training with both multi-modal textbfUnderstanding and future textbfPrediction objectives.<n>UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method.
arXiv Detail & Related papers (2025-01-31T03:20:09Z) - Human-Timescale Adaptation in an Open-Ended Task Space [56.55530165036327]
We show that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans.
Our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
arXiv Detail & Related papers (2023-01-18T15:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.