MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
- URL: http://arxiv.org/abs/2406.16258v2
- Date: Mon, 28 Oct 2024 19:17:41 GMT
- Title: MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
- Authors: Yuxin Chen, Chen Tang, Chenran Li, Ran Tian, Wei Zhan, Peter Stone, Masayoshi Tomizuka,
- Abstract summary: We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention.
MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions.
It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
- Score: 81.56607128684723
- License:
- Abstract: Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.
Related papers
- FDPP: Fine-tune Diffusion Policy with Human Preference [57.44575105114056]
Fine-tuning Diffusion Policy with Human Preference learns a reward function through preference-based learning.
This reward is then used to fine-tune the pre-trained policy with reinforcement learning.
Experiments demonstrate that FDPP effectively customizes policy behavior without compromising performance.
arXiv Detail & Related papers (2025-01-14T17:15:27Z) - Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment [73.14105098897696]
We propose Representation-Aligned Preference-based Learning (RAPL) to learn visual rewards from significantly less human preference feedback.
RAPL focuses on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching.
We show that RAPL can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments.
arXiv Detail & Related papers (2024-12-06T08:04:02Z) - Offline Risk-sensitive RL with Partial Observability to Enhance
Performance in Human-Robot Teaming [1.3980986259786223]
We propose a method to incorporate model uncertainty, thus enabling risk-sensitive sequential decision-making.
Experiments were conducted with a group of twenty-six human participants within a simulated robot teleoperation environment.
arXiv Detail & Related papers (2024-02-08T14:27:34Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization [16.335645061396455]
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.
We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL)
Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability.
arXiv Detail & Related papers (2023-11-30T16:53:32Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Learning Human Rewards by Inferring Their Latent Intelligence Levels in
Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data [18.750834997334664]
We argue that humans are bounded rational and have different intelligence levels when reasoning about others' decision-making process.
We propose a new multi-agent Inverse Reinforcement Learning framework that reasons about humans' latent intelligence levels during learning.
arXiv Detail & Related papers (2021-03-07T07:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.