Inference-Time Policy Steering through Human Interactions
- URL: http://arxiv.org/abs/2411.16627v1
- Date: Mon, 25 Nov 2024 18:03:50 GMT
- Title: Inference-Time Policy Steering through Human Interactions
- Authors: Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sundaralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez-D'Arpino, Dieter Fox, Julie Shah,
- Abstract summary: During inference, humans are often removed from the policy execution loop.
We propose an Inference-Time Policy Steering framework that leverages human interactions to bias the generative sampling process.
Our proposed sampling strategy achieves the best trade-off between alignment and distribution shift.
- Score: 54.02655062969934
- License:
- Abstract: Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than fine-tuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/.
Related papers
- IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning [43.19346528232497]
A popular approach for increasing policy robustness to distribution shift is interactive imitation learning.
We propose IntervenGen, a novel data generation system that can autonomously produce a large set of corrective interventions.
We show that it can increase policy robustness by up to 39x with only 10 human interventions.
arXiv Detail & Related papers (2024-05-02T17:06:19Z) - Policy-Guided Diffusion [30.4597043728046]
In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy.
We propose policy-guided diffusion as an alternative to autoregressive offline world models.
We show that policy-guided diffusion models a regularized form of the target distribution that balances action likelihood under both the target and behavior policies.
arXiv Detail & Related papers (2024-04-09T14:46:48Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Learning Latent Traits for Simulated Cooperative Driving Tasks [10.009803620912777]
We build a framework capable of capturing a compact latent representation of the human in terms of their behavior and preferences.
We then build a lightweight simulation environment, HMIway-env, for modelling one form of distracted driving behavior.
We finally use this environment to quantify both the ability to discriminate drivers and the effectiveness of intervention policies.
arXiv Detail & Related papers (2022-07-20T02:27:18Z) - The Boltzmann Policy Distribution: Accounting for Systematic
Suboptimality in Human Models [5.736353542430439]
We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies.
BPD adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode.
We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models.
arXiv Detail & Related papers (2022-04-22T15:26:25Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Human-in-the-Loop Imitation Learning using Remote Teleoperation [72.2847988686463]
We build a data collection system tailored to 6-DoF manipulation settings.
We develop an algorithm to train the policy iteratively on new data collected by the system.
We demonstrate that agents trained on data collected by our intervention-based system and algorithm outperform agents trained on an equivalent number of samples collected by non-interventional demonstrators.
arXiv Detail & Related papers (2020-12-12T05:30:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.