Related papers: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

URL: http://arxiv.org/abs/2509.22963v2
Date: Wed, 01 Oct 2025 00:48:42 GMT
Title: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces
Authors: Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz,
Abstract summary: Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
Score: 57.466101098183884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

Related papers

Dichotomous Diffusion Policy Optimization [46.51375996317989]
DIPOLE is a novel RL algorithm designed for stable and controllable diffusion policy optimization.<n>We also use DIPOLE to train a large vision-language-action model for end-to-end autonomous driving.
arXiv Detail & Related papers (2025-12-31T16:56:56Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning [0.0]
We propose a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples.<n>Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability.<n>We extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference.
arXiv Detail & Related papers (2025-06-26T16:09:53Z)
Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.<n>DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.<n>We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z)
Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Policy Representation via Diffusion Probability Model for Reinforcement Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model. We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.