Related papers: Diffusion Actor-Critic with Entropy Regulator

Diffusion Actor-Critic with Entropy Regulator

URL: http://arxiv.org/abs/2405.15177v4
Date: Thu, 10 Oct 2024 02:38:46 GMT
Title: Diffusion Actor-Critic with Entropy Regulator
Authors: Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, Shengbo Eben Li,
Abstract summary: We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER) This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function. Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
Score: 32.79341490514616
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $\alpha$ that modulates the degree of exploration and exploitation. Parameter $\alpha$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.

Related papers

Diffusion Policy through Conditional Proximal Policy Optimization [6.836651088754774]
diffusion policies have shown strong potential in modeling multi-modal behaviors.<n>A key challenge is the difficulty of computing action log-likelihood under the diffusion model.<n>We propose a novel and efficient method to train a diffusion policy in an on-policy setting.
arXiv Detail & Related papers (2026-03-05T04:12:13Z)
A Diffusion Model Framework for Maximum Entropy Reinforcement Learning [32.26181994745642]
We present a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way.<n>We find that DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
arXiv Detail & Related papers (2025-12-01T18:59:58Z)
One-Step Flow Policy Mirror Descent [52.31612487608593]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during flow policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z)
Distributional Soft Actor-Critic with Diffusion Policy [12.762838783617658]
This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Policy Diffusion) to address the challenges of estimating bias in value functions.<n>The proposed algorithm achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms.
arXiv Detail & Related papers (2025-07-02T05:50:10Z)
Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design [53.93023688824764]
We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design.<n>We propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions.<n>Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods.
arXiv Detail & Related papers (2025-07-01T05:55:28Z)
DIME:Diffusion-Based Maximum Entropy Reinforcement Learning [37.420420953705396]
Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. We propose Diffusion-Based Maximum Entropy RL (DIME) to overcome the intractability of computing their marginal entropy.
arXiv Detail & Related papers (2025-02-04T13:37:14Z)
Sampling from Energy-based Policies using Diffusion [14.542411354617983]
We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.
arXiv Detail & Related papers (2024-10-02T08:09:33Z)
Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions. We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z)
Equivariant Diffusion Policy [16.52810213171303]
We propose a novel diffusion policy learning method that leverages domain symmetries to obtain better sample efficiency and generalization in the denoising function. We evaluate the method empirically on a set of 12 simulation tasks in MimicGen, and show that it obtains a success rate that is, on average, 21.9% higher than the baseline Diffusion Policy.
arXiv Detail & Related papers (2024-07-01T21:23:26Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Improved off-policy training of diffusion samplers [93.66433483772055]
We study the problem of training diffusion models to sample from a distribution with an unnormalized density or energy function. We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods. Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work.
arXiv Detail & Related papers (2024-02-07T18:51:49Z)
Policy Representation via Diffusion Probability Model for Reinforcement Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model. We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.