Related papers: Bootstrap Off-policy with World Model

Bootstrap Off-policy with World Model

URL: http://arxiv.org/abs/2511.00423v1
Date: Sat, 01 Nov 2025 06:33:04 GMT
Title: Bootstrap Off-policy with World Model
Authors: Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li,
Abstract summary: We propose BOOM, a framework that tightly integrates planning and off-policy learning through a bootstrap loop.<n>BOOM achieves state-of-the-art results in both training stability and final performance.
Score: 59.129118672069644
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy's actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner's non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner's action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

Related papers

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning [13.655904209137006]
We propose textbfImaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference.<n>Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data.<n>By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference.
arXiv Detail & Related papers (2026-03-04T17:05:39Z)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z)
A KL-regularization framework for learning to plan with adaptive priors [1.0246259631050245]
We introduce Policy Optimization-Model Predictive Control (PO-MPC)<n>PO-MPC integrates the planner's action distribution as a prior in policy optimization.<n>Our experiments show that these extended configurations yield significant performance improvements.
arXiv Detail & Related papers (2025-10-05T16:45:38Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
Behavior Preference Regression for Offline Reinforcement Learning [0.0]
offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset.<n>Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the policy.<n>We adapt this approach of paired comparison to Behavior Regression Preference for offline RL.<n>We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite.
arXiv Detail & Related papers (2025-03-02T15:13:02Z)
Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning [9.341618348621662]
We aim to find the best-performing policy within a limited budget of online interactions. We first study the major online RL exploration methods based on intrinsic rewards and UCB. We then introduce an algorithm for planning to go out-of-distribution that avoids these issues.
arXiv Detail & Related papers (2023-10-09T13:47:05Z)
A Unified Framework for Alternating Offline Model Training and Policy Learning [62.19209005400561]
In offline model-based reinforcement learning, we learn a dynamic model from historically collected data, and utilize the learned model and fixed datasets for policy learning. We develop an iterative offline MBRL framework, where we maximize a lower bound of the true expected return. With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets.
arXiv Detail & Related papers (2022-10-12T04:58:51Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
Learning Off-Policy with Online Planning [18.63424441772675]
We investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function. We show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments.
arXiv Detail & Related papers (2020-08-23T16:18:44Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.