Related papers: Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

URL: http://arxiv.org/abs/2304.02532v2
Date: Thu, 1 Jun 2023 15:18:21 GMT
Title: Goal-Conditioned Imitation Learning using Score-based Diffusion Policies
Authors: Moritz Reuss, Maximilian Li, Xiaogang Jia, Rudolf Lioutikov
Abstract summary: We propose a new policy representation based on score-based diffusion models (SDMs) We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) We show how BESO can even be used to learn a goal-independent policy from play-data usingintuitive-free guidance.
Score: 3.49482137286472
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for goal-conditioned behavior generation. Demonstrations and Code are available at https://intuitive-robots.github.io/beso-website/

Related papers

Dense Policy: Bidirectional Autoregressive Learning of Actions [51.60428100831717]
This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner. Experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies.
arXiv Detail & Related papers (2025-03-17T14:28:08Z)
Probabilistic Subgoal Representations for Hierarchical Reinforcement learning [16.756888009396462]
In goal-conditioned hierarchical reinforcement learning, a high-level policy specifies a subgoal for the low-level policy to reach. Existing methods adopt a subgoal representation that provides a deterministic mapping from state space to latent subgoal space. This paper employs a GP prior on the latent subgoal space to learn a posterior distribution over the subgoal representation functions.
arXiv Detail & Related papers (2024-06-24T15:09:22Z)
Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution. We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z)
Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion [16.44141792109178]
Diffusion models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise. The target policy to be learned is often significantly different from Gaussian and this can result in poor performance when using a small number of diffusion steps. We contribute both theoretical results, a new method, and empirical findings that show the benefits of using an informative source policy.
arXiv Detail & Related papers (2024-02-25T12:19:21Z)
Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function. We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z)
Language-Conditioned Semantic Search-Based Policy for Robotic Manipulation Tasks [2.1332830068386217]
We propose a language-conditioned semantic search-based method to produce an online search-based policy. Our approach surpasses the performance of the baselines on the CALVIN benchmark and exhibits strong zero-shot adaptation capabilities.
arXiv Detail & Related papers (2023-12-10T16:17:00Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
HIQL: Offline Goal-Conditioned RL with Latent States as Actions [81.67963770528753]
We propose a hierarchical algorithm for goal-conditioned RL from offline data. We show how this hierarchical decomposition makes our method robust to noise in the estimated value function. Our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data.
arXiv Detail & Related papers (2023-07-22T00:17:36Z)
Comparing the Efficacy of Fine-Tuning and Meta-Learning for Few-Shot Policy Imitation [45.312333134810665]
State-of-the-art methods to tackle few-shot imitation rely on meta-learning. Recent work has shown that fine-tuners outperform meta-learners in few-shot image classification tasks. We release an open source dataset called iMuJoCo consisting of 154 variants of popular OpenAI-Gym MuJoCo environments.
arXiv Detail & Related papers (2023-06-23T15:29:15Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
Contextual Policy Transfer in Reinforcement Learning Domains via Deep Mixtures-of-Experts [24.489002406693128]
We introduce a novel mixture-of-experts formulation for learning state-dependent beliefs over source task dynamics. We show how this model can be incorporated into standard policy reuse frameworks.
arXiv Detail & Related papers (2020-02-29T07:58:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.