Related papers: Sampling from Energy-based Policies using Diffusion

Sampling from Energy-based Policies using Diffusion

URL: http://arxiv.org/abs/2410.01312v1
Date: Wed, 2 Oct 2024 08:09:33 GMT
Title: Sampling from Energy-based Policies using Diffusion
Authors: Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh,
Abstract summary: We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.
Score: 14.542411354617983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation - limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.

Related papers

Diffusion Policy through Conditional Proximal Policy Optimization [6.836651088754774]
diffusion policies have shown strong potential in modeling multi-modal behaviors.<n>A key challenge is the difficulty of computing action log-likelihood under the diffusion model.<n>We propose a novel and efficient method to train a diffusion policy in an on-policy setting.
arXiv Detail & Related papers (2026-03-05T04:12:13Z)
Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control [1.7495213911983414]
We introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution.<n>By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks.<n>Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
arXiv Detail & Related papers (2025-08-19T15:18:01Z)
One-Step Flow Policy Mirror Descent [38.39095131927252]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z)
Distributional Soft Actor-Critic with Diffusion Policy [12.762838783617658]
This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Policy Diffusion) to address the challenges of estimating bias in value functions.<n>The proposed algorithm achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms.
arXiv Detail & Related papers (2025-07-02T05:50:10Z)
Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Diffusion Actor-Critic with Entropy Regulator [32.79341490514616]
We propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER) This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function. Experiments on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-05-24T03:23:27Z)
Discrete Probabilistic Inference as Control in Multi-path Environments [84.67055173040107]
We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem. We show that GFlowNets learn a policy that samples objects proportionally to their reward by enforcing a conservation of flows. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward.
arXiv Detail & Related papers (2024-02-15T20:20:35Z)
Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning [11.0460569590737]
This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL) We show that an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks.
arXiv Detail & Related papers (2024-02-06T15:34:30Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Policy Representation via Diffusion Probability Model for Reinforcement Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model. We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.