The Boltzmann Policy Distribution: Accounting for Systematic
Suboptimality in Human Models
- URL: http://arxiv.org/abs/2204.10759v1
- Date: Fri, 22 Apr 2022 15:26:25 GMT
- Title: The Boltzmann Policy Distribution: Accounting for Systematic
Suboptimality in Human Models
- Authors: Cassidy Laidlaw and Anca Dragan
- Abstract summary: We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies.
BPD adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode.
We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models.
- Score: 5.736353542430439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models of human behavior for prediction and collaboration tend to fall into
two categories: ones that learn from large amounts of data via imitation
learning, and ones that assume human behavior to be noisily-optimal for some
reward function. The former are very useful, but only when it is possible to
gather a lot of human data in the target environment and distribution. The
advantage of the latter type, which includes Boltzmann rationality, is the
ability to make accurate predictions in new environments without extensive data
when humans are actually close to optimal. However, these models fail when
humans exhibit systematic suboptimality, i.e. when their deviations from
optimal behavior are not independent, but instead consistent over time. Our key
insight is that systematic suboptimality can be modeled by predicting policies,
which couple action choices over time, instead of trajectories. We introduce
the Boltzmann policy distribution (BPD), which serves as a prior over human
policies and adapts via Bayesian inference to capture systematic deviations by
observing human actions during a single episode. The BPD is difficult to
compute and represent because policies lie in a high-dimensional continuous
space, but we leverage tools from generative and sequence models to enable
efficient sampling and inference. We show that the BPD enables prediction of
human behavior and human-AI collaboration equally as well as imitation
learning-based human models while using far less data.
Related papers
- Inference-Time Policy Steering through Human Interactions [54.02655062969934]
During inference, humans are often removed from the policy execution loop.
We propose an Inference-Time Policy Steering framework that leverages human interactions to bias the generative sampling process.
Our proposed sampling strategy achieves the best trade-off between alignment and distribution shift.
arXiv Detail & Related papers (2024-11-25T18:03:50Z) - How Aligned are Generative Models to Humans in High-Stakes Decision-Making? [10.225573060836478]
Large generative models (LMs) are increasingly being considered for high-stakes decision-making.
This work considers how such models compare to humans and predictive AI models on a specific case of recidivism prediction.
arXiv Detail & Related papers (2024-10-20T19:00:59Z) - Ask Your Distribution Shift if Pre-Training is Right for You [74.18516460467019]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.
We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.
Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z) - Human Trajectory Forecasting with Explainable Behavioral Uncertainty [63.62824628085961]
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars.
Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well.
We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods.
arXiv Detail & Related papers (2023-07-04T16:45:21Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Optimal Behavior Prior: Data-Efficient Human Models for Improved
Human-AI Collaboration [0.5524804393257919]
We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient.
We also show that using these improved human models often leads to better human-AI collaboration performance.
arXiv Detail & Related papers (2022-11-03T06:10:22Z) - Probabilistic Modeling for Human Mesh Recovery [73.11532990173441]
This paper focuses on the problem of 3D human reconstruction from 2D evidence.
We recast the problem as learning a mapping from the input to a distribution of plausible 3D poses.
arXiv Detail & Related papers (2021-08-26T17:55:11Z) - On complementing end-to-end human motion predictors with planning [31.025766804649464]
High capacity end-to-end approaches for human motion prediction have the ability to represent subtle nuances in human behavior, but struggle with robustness to out of distribution inputs and tail events.
Planning-based prediction, on the other hand, can reliably output decent-but-not-great predictions.
arXiv Detail & Related papers (2021-03-09T19:02:45Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.