Reward Learning as Doubly Nonparametric Bandits: Optimal Design and
Scaling Laws
- URL: http://arxiv.org/abs/2302.12349v1
- Date: Thu, 23 Feb 2023 22:07:33 GMT
- Title: Reward Learning as Doubly Nonparametric Bandits: Optimal Design and
Scaling Laws
- Authors: Kush Bhatia, Wenshuo Guo, Jacob Steinhardt
- Abstract summary: We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem.
We first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression.
We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate.
- Score: 22.099915149343957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Specifying reward functions for complex tasks like object manipulation or
driving is challenging to do by hand. Reward learning seeks to address this by
learning a reward model using human feedback on selected query policies. This
shifts the burden of reward specification to the optimal design of the queries.
We propose a theoretical framework for studying reward learning and the
associated optimal experiment design problem. Our framework models rewards and
policies as nonparametric functions belonging to subsets of Reproducing Kernel
Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true
reward and must output a policy that performs well under the true reward. For
this setting, we first derive non-asymptotic excess risk bounds for a simple
plug-in estimator based on ridge regression. We then solve the query design
problem by optimizing these risk bounds with respect to the choice of query set
and obtain a finite sample statistical rate, which depends primarily on the
eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the
generality of these results, our bounds are stronger than previous bounds
developed for more specialized problems. We specifically show that the
well-studied problem of Gaussian process (GP) bandit optimization is a special
case of our framework, and that our bounds either improve or are competitive
with known regret guarantees for the Mat\'ern kernel.
Related papers
- Exploration in Model-based Reinforcement Learning with Randomized Reward [40.87376174638752]
We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism.
We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
arXiv Detail & Related papers (2023-01-09T01:50:55Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function.
We propose a general framework for both algorithm performance analysis.
We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z) - Meta-Learning Hypothesis Spaces for Sequential Decision-making [79.73213540203389]
We propose to meta-learn a kernel from offline data (Meta-KeL)
Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets.
We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
arXiv Detail & Related papers (2022-02-01T17:46:51Z) - Risk-Aware Algorithms for Combinatorial Semi-Bandits [7.716156977428555]
We study the multi-armed bandit problem under semi-bandit feedback.
We consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards.
We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the bandit.
arXiv Detail & Related papers (2021-12-02T11:29:43Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z) - Constrained Upper Confidence Reinforcement Learning [12.919486518128734]
This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori.
We show that an algorithm C-UCRL achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-delta$.
arXiv Detail & Related papers (2020-01-26T00:23:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.