Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient
- URL: http://arxiv.org/abs/2406.00681v1
- Date: Sun, 2 Jun 2024 09:32:28 GMT
- Title: Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient
- Authors: Zechu Li, Rickmer Krohn, Tao Chen, Anurag Ajay, Pulkit Agrawal, Georgia Chalvatzaki,
- Abstract summary: Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies.
DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective.
Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
- Score: 26.675822002049372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.
Related papers
- Diffusion Policy through Conditional Proximal Policy Optimization [6.836651088754774]
diffusion policies have shown strong potential in modeling multi-modal behaviors.<n>A key challenge is the difficulty of computing action log-likelihood under the diffusion model.<n>We propose a novel and efficient method to train a diffusion policy in an on-policy setting.
arXiv Detail & Related papers (2026-03-05T04:12:13Z) - From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Offline Reinforcement Learning with Generative Trajectory Policies [6.501269050121785]
Generative models have emerged as a powerful class of policies for offline reinforcement learning.<n>Existing methods face a stark trade-off: slow, iterative models are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance.<n>We propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE.
arXiv Detail & Related papers (2025-10-13T15:06:28Z) - Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z) - Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control [1.7495213911983414]
We introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution.<n>By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks.<n>Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
arXiv Detail & Related papers (2025-08-19T15:18:01Z) - One-Step Flow Policy Mirror Descent [52.31612487608593]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during flow policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z) - Distributional Soft Actor-Critic with Diffusion Policy [12.762838783617658]
This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Policy Diffusion) to address the challenges of estimating bias in value functions.<n>The proposed algorithm achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms.
arXiv Detail & Related papers (2025-07-02T05:50:10Z) - Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods [4.028503203417233]
We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm for fine-tuning discrete diffusion models over non-differentiable rewards.
Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method.
arXiv Detail & Related papers (2025-02-03T14:20:19Z) - Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learning [9.38848713730931]
offline reinforcement learning seeks to learn optimal policies from static datasets without interacting with the environment.
Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated.
We propose weighted imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy.
arXiv Detail & Related papers (2024-12-04T11:57:36Z) - Sampling from Energy-based Policies using Diffusion [14.542411354617983]
We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function.
We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.
arXiv Detail & Related papers (2024-10-02T08:09:33Z) - DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning [22.323173093804897]
offline reinforcement learning can learn optimal policies from pre-collected offline datasets without interacting with the environment.
Recent works address this issue by employing generative adversarial networks (GANs)
Inspired by the diffusion, we propose a new offline RL method named Diffusion Policies with Generative Adversarial Networks (DiffPoGAN)
arXiv Detail & Related papers (2024-06-13T13:15:40Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - MADiff: Offline Multi-agent Learning with Diffusion Models [79.18130544233794]
MADiff is a diffusion-based multi-agent learning framework.
It works as both a decentralized policy and a centralized controller.
Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks.
arXiv Detail & Related papers (2023-05-27T02:14:09Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning [7.020079427649125]
We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance.
We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
arXiv Detail & Related papers (2021-04-19T08:21:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.