DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
- URL: http://arxiv.org/abs/2509.19538v1
- Date: Tue, 23 Sep 2025 20:06:26 GMT
- Title: DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
- Authors: Zongyue Li, Xiao Han, Yusong Li, Niklas Strauss, Matthias Schubert,
- Abstract summary: We propose a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go.<n>We show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories.
- Score: 6.723690093335988
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.
Related papers
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making [48.998030470623384]
offline decision-making requires reliable behaviors from fixed datasets without further interaction.<n>We propose a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives.
arXiv Detail & Related papers (2025-12-09T06:26:02Z) - Controllable Flow Matching for Online Reinforcement Learning [5.944099401274571]
We propose CtrlFlow, a trajectory-level synthetic method using conditional flow matching (CFM)<n>Our method ensures optimal trajectory sampling by minimizing the control energy governed by the non-linear Controllability Gramian Matrix.<n>In online settings, CtrlFlow demonstrates the better performance on common MuJoCo benchmark tasks than dynamics models.
arXiv Detail & Related papers (2025-11-10T08:01:20Z) - Offline Reinforcement Learning with Generative Trajectory Policies [6.501269050121785]
Generative models have emerged as a powerful class of policies for offline reinforcement learning.<n>Existing methods face a stark trade-off: slow, iterative models are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance.<n>We propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE.
arXiv Detail & Related papers (2025-10-13T15:06:28Z) - Scaling Offline RL via Efficient and Expressive Shortcut Models [13.050231036248338]
offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes.<n>We introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models to scale both training and inference.<n>We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute.
arXiv Detail & Related papers (2025-05-28T20:59:22Z) - Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models [79.2162092822111]
We systematically evaluate reinforcement learning (RL) and control-based methods on a suite of navigation tasks.<n>We employ a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning.<n>Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts.
arXiv Detail & Related papers (2025-02-20T18:39:41Z) - Are Expressive Models Truly Necessary for Offline RL? [18.425797519857113]
Sequential modeling requires capturing accurate dynamics across long horizons in trajectory data to ensure reasonable policy performance.<n>We show that lightweight models as simple as shallow 2-layers can enjoy accurate dynamics consistency and significantly reduced sequential modeling errors.
arXiv Detail & Related papers (2024-12-15T17:33:56Z) - Large Vision Model-Enhanced Digital Twin with Deep Reinforcement Learning for User Association and Load Balancing in Dynamic Wireless Networks [17.041443813376546]
This paper introduces a large vision model (LVM)-enhanced digital twin (DT) for wireless networks.<n>We propose a parallel DRL method for user association and load balancing in networks with dynamic user counts, distribution, and mobility patterns.<n> Numerical results show that the developed LVM-enhanced DT achieves closely comparable training efficacy to the real environment.
arXiv Detail & Related papers (2024-10-10T04:54:48Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task.
The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance.
We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z) - Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning [93.99377042564919]
This paper tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages.
The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies.
We introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces.
arXiv Detail & Related papers (2023-05-24T15:45:35Z) - Backward Imitation and Forward Reinforcement Learning via Bi-directional
Model Rollouts [11.4219428942199]
Traditional model-based reinforcement learning (RL) methods generate forward rollout traces using the learnt dynamics model.
In this paper, we propose the backward imitation and forward reinforcement learning (BIFRL) framework.
BIFRL empowers the agent to both reach to and explore from high-value states in a more efficient manner.
arXiv Detail & Related papers (2022-08-04T04:04:05Z) - POAR: Efficient Policy Optimization via Online Abstract State
Representation Learning [6.171331561029968]
State Representation Learning (SRL) is proposed to specifically learn to encode task-relevant features from complex sensory data into low-dimensional states.
We introduce a new SRL prior called domain resemblance to leverage expert demonstration to improve SRL interpretations.
We empirically verify POAR to efficiently handle tasks in high dimensions and facilitate training real-life robots directly from scratch.
arXiv Detail & Related papers (2021-09-17T16:52:03Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - Offline Reinforcement Learning from Images with Latent Space Models [60.69745540036375]
offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions.
We build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces.
Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP.
arXiv Detail & Related papers (2020-12-21T18:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.