FDPP: Fine-tune Diffusion Policy with Human Preference
- URL: http://arxiv.org/abs/2501.08259v1
- Date: Tue, 14 Jan 2025 17:15:27 GMT
- Title: FDPP: Fine-tune Diffusion Policy with Human Preference
- Authors: Yuxin Chen, Devesh K. Jha, Masayoshi Tomizuka, Diego Romeres,
- Abstract summary: Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
- Score: 57.44575105114056
- License:
- Abstract: Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.
Related papers
- Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment [73.14105098897696]
We propose Representation-Aligned Preference-based Learning (RAPL) to learn visual rewards from significantly less human preference feedback.
RAPL focuses on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching.
We show that RAPL can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments.
arXiv Detail & Related papers (2024-12-06T08:04:02Z) - Personalisation via Dynamic Policy Fusion [14.948610521764415]
Deep reinforcement learning policies may not align with the personal preferences of human users.
We propose a more practical approach - to adapt the already trained policy to user-specific needs with the help of human feedback.
We empirically demonstrate in a number of environments that our proposed dynamic policy fusion approach consistently achieves the intended task.
arXiv Detail & Related papers (2024-09-30T07:23:47Z) - Forward KL Regularized Preference Optimization for Aligning Diffusion Policies [8.958830452149789]
A central problem for learning diffusion policies is to align the policy output with human intents in various tasks.
We propose a novel framework, Forward KL regularized Preference optimization, to align the diffusion policy with preferences directly.
The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
arXiv Detail & Related papers (2024-09-09T13:56:03Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [43.77763433288893]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward.
This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS)
FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints.
We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.