Choice between Partial Trajectories
- URL: http://arxiv.org/abs/2410.22690v1
- Date: Wed, 30 Oct 2024 04:52:22 GMT
- Title: Choice between Partial Trajectories
- Authors: Henrik Marklund, Benjamin Van Roy,
- Abstract summary: It has been suggested that AI agents learn preferences from human choice data.
This approach requires a model of choice behavior that the agent can use to interpret the data.
We consider an alternative model based on the bootstrapped return, which adds to the partial return an estimate of the future return.
- Score: 19.39067577784909
- License:
- Abstract: As AI agents generate increasingly sophisticated behaviors, manually encoding human preferences to guide these agents becomes more challenging. To address this, it has been suggested that agents instead learn preferences from human choice data. This approach requires a model of choice behavior that the agent can use to interpret the data. For choices between partial trajectories of states and actions, previous models assume choice probabilities to be determined by the partial return or the cumulative advantage. We consider an alternative model based instead on the bootstrapped return, which adds to the partial return an estimate of the future return. Benefits of the bootstrapped return model stem from its treatment of human beliefs. Unlike partial return, choices based on bootstrapped return reflect human beliefs about the environment. Further, while recovering the reward function from choices based on cumulative advantage requires that those beliefs are correct, doing so from choices based on bootstrapped return does not. To motivate the bootstrapped return model, we formulate axioms and prove an Alignment Theorem. This result formalizes how, for a general class of human preferences, such models are able to disentangle goals from beliefs. This ensures recovery of an aligned reward function when learning from choices based on bootstrapped return. The bootstrapped return model also affords greater robustness to choice behavior. Even when choices are based on partial return, learning via a bootstrapped return model recovers an aligned reward function. The same holds with choices based on the cumulative advantage if the human and the agent both adhere to correct and consistent beliefs about the environment. On the other hand, if choices are based on bootstrapped return, learning via partial return or cumulative advantage models does not generally produce an aligned reward function.
Related papers
- Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers.
We present users with incorrect answers and corresponding rationales in various formats.
We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z) - Model Agnostic Explainable Selective Regression via Uncertainty
Estimation [15.331332191290727]
This paper presents a novel approach to selective regression that utilizes model-agnostic non-parametric uncertainty estimation.
Our proposed framework showcases superior performance compared to state-of-the-art selective regressors.
We implement our selective regression method in the open-source Python package doubt and release the code used to reproduce our experiments.
arXiv Detail & Related papers (2023-11-15T17:40:48Z) - Learning Optimal Advantage from Preferences and Mistaking it for Reward [43.58066500250688]
Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return.
We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret.
This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.
arXiv Detail & Related papers (2023-10-03T21:58:24Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach [45.83200636718999]
A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.
We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution.
Experimental results show that our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-28T21:51:38Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.