Internalizing World Models via Self-Play Finetuning for Agentic RL
- URL: http://arxiv.org/abs/2510.15047v1
- Date: Thu, 16 Oct 2025 18:03:39 GMT
- Title: Internalizing World Models via Self-Play Finetuning for Agentic RL
- Authors: Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, Manling Li,
- Abstract summary: Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios.<n>We show how to encode this world model by decomposing it into two components: state representation and transition modeling.<n>We introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning stage to learn the world model.
- Score: 65.96875390986655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k--the probability that at least one of (k) sampled trajectories succeeds--drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.
Related papers
- Reinforcement World Model Learning for LLM-based Agents [60.65003139516272]
Reinforcement World Model Learning (RWML) is a self-conditioned method that learns action-supervised world models for LLM-based agents.<n>Our method aligns simulated next states produced by the model with realized next states observed from the environment.<n>We evaluate our method on ALFWorld and $2$ Bench and observe significant gains over the base model, despite being entirely self-supervised.
arXiv Detail & Related papers (2026-02-05T16:30:08Z) - CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning [7.780242426487376]
We propose Chunked RL, a novel reinforcement learning framework for Vision-Language-Action (VLA) models.<n>Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models.<n>We then propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations.
arXiv Detail & Related papers (2025-08-04T09:11:48Z) - SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model [88.04128601981145]
We introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning.<n>modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation.<n>World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning.
arXiv Detail & Related papers (2025-07-31T17:57:20Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model [55.276852838877346]
Self-evolving agents are trained on trajectories sampled autonomously based on their own policies.<n>We propose a novel framework that introduces a co-evolving World Model LLM.<n>This world model predicts the next observation based on the current observation and action within the web environment.
arXiv Detail & Related papers (2025-04-23T02:54:31Z) - WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents [55.64361927346957]
We propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to large language models (LLMs)<n>We also propose an RL-free, model-based agent "WALL-E 2.0" through the model-predictive control framework.<n> WALL-E 2.0 significantly outperforms existing methods on open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments)
arXiv Detail & Related papers (2025-04-22T10:58:27Z) - Accelerating Model-Based Reinforcement Learning with State-Space World Models [18.71404724458449]
Reinforcement learning (RL) is a powerful approach for robot learning.<n>However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies.<n>We propose a new method for accelerating model-based RL using state-space world models.
arXiv Detail & Related papers (2025-02-27T15:05:25Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement
Learning [45.73223325256312]
We investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster.
We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models.
arXiv Detail & Related papers (2022-10-19T17:57:06Z) - Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep
Reinforcement Learning with Demonstration-like Sampled Exploration [7.930709072852582]
We propose a generic framework for Learning from Demonstration (LfD) based on actor-critic algorithms.
We conduct experiments on 4 standard benchmark environments in Mujoco and 2 self-designed robotic environments.
arXiv Detail & Related papers (2021-09-27T12:42:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.