Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
- URL: http://arxiv.org/abs/2412.04835v1
- Date: Fri, 06 Dec 2024 08:04:02 GMT
- Title: Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
- Authors: Ran Tian, Yilin Wu, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy,
- Abstract summary: We propose Representation-Aligned Preference-based Learning (RAPL) to learn visual rewards from significantly less human preference feedback.
RAPL focuses on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching.
We show that RAPL can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments.
- Score: 73.14105098897696
- License:
- Abstract: Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.
Related papers
- PILAF: Optimal Human Preference Sampling for Reward Modeling [14.336058926701432]
We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling.
PILAF explicitly aligns preference learning with maximizing the underlying oracle reward.
arXiv Detail & Related papers (2025-02-06T18:09:00Z) - FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - What Matters to You? Towards Visual Representation Alignment for Robot
Learning [81.30964736676103]
When operating in service of people, robots need to optimize rewards aligned with end-user preferences.
We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem.
arXiv Detail & Related papers (2023-10-11T23:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.