Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion
- URL: http://arxiv.org/abs/2402.16075v4
- Date: Thu, 11 Jul 2024 03:41:42 GMT
- Title: Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion
- Authors: Kaiqi Chen, Eugene Lim, Kelvin Lin, Yiyang Chen, Harold Soh,
- Abstract summary: Diffusion models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise.
The target policy to be learned is often significantly different from Gaussian and this can result in poor performance when using a small number of diffusion steps.
We contribute both theoretical results, a new method, and empirical findings that show the benefits of using an informative source policy.
- Score: 16.44141792109178
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Imitation learning empowers artificial agents to mimic behavior by learning from demonstrations. Recently, diffusion models, which have the ability to model high-dimensional and multimodal distributions, have shown impressive performance on imitation learning tasks. These models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise. However, the target policy to be learned is often significantly different from Gaussian and this mismatch can result in poor performance when using a small number of diffusion steps (to improve inference speed) and under limited data. The key idea in this work is that initiating from a more informative source than Gaussian enables diffusion methods to mitigate the above limitations. We contribute both theoretical results, a new method, and empirical findings that show the benefits of using an informative source policy. Our method, which we call BRIDGER, leverages the stochastic interpolants framework to bridge arbitrary policies, thus enabling a flexible approach towards imitation learning. It generalizes prior work in that standard Gaussians can still be applied, but other source policies can be used if available. In experiments on challenging simulation benchmarks and on real robots, BRIDGER outperforms state-of-the-art diffusion policies. We provide further analysis on design considerations when applying BRIDGER. Code for BRIDGER is available at https://github.com/clear-nus/bridger.
Related papers
- Latent Weight Diffusion: Generating Policies from Trajectories [13.404962654299041]
Latent Weight Diffusion (LWD) is a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories.
LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to 18x smaller during inference.
arXiv Detail & Related papers (2024-10-17T21:30:29Z) - Equivariant Diffusion Policy [16.52810213171303]
We propose a novel diffusion policy learning method that leverages domain symmetries to obtain better sample efficiency and generalization in the denoising function.
We evaluate the method empirically on a set of 12 simulation tasks in MimicGen, and show that it obtains a success rate that is, on average, 21.9% higher than the baseline Diffusion Policy.
arXiv Detail & Related papers (2024-07-01T21:23:26Z) - Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies.
DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective.
Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.