Related papers: Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

URL: http://arxiv.org/abs/2302.12349v1
Date: Thu, 23 Feb 2023 22:07:33 GMT
Title: Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws
Authors: Kush Bhatia, Wenshuo Guo, Jacob Steinhardt
Abstract summary: We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. We first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate.
Score: 22.099915149343957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special case of our framework, and that our bounds either improve or are competitive with known regret guarantees for the Mat\'ern kernel.

Related papers

Scalable Submodular Policy Optimization via Pruned Submodularity Graph [2.8672152503836]
In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution.<n>The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized.
arXiv Detail & Related papers (2025-07-18T11:42:07Z)
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design [11.313040194648828]
We study reinforcement learning from human feedback in general Markov decision processes.<n>A central challenge is to design algorithms that select informative preference queries to identify the underlying reward.<n>We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches.
arXiv Detail & Related papers (2025-06-11T08:27:16Z)
Exploration in Model-based Reinforcement Learning with Randomized Reward [40.87376174638752]
We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
arXiv Detail & Related papers (2023-01-09T01:50:55Z)
Contextual bandits with concave rewards, and an application to fair ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR) We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function. We propose a general framework for both algorithm performance analysis. We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z)
Meta-Learning Hypothesis Spaces for Sequential Decision-making [79.73213540203389]
We propose to meta-learn a kernel from offline data (Meta-KeL) Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
arXiv Detail & Related papers (2022-02-01T17:46:51Z)
Risk-Aware Algorithms for Combinatorial Semi-Bandits [7.716156977428555]
We study the multi-armed bandit problem under semi-bandit feedback. We consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards. We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the bandit.
arXiv Detail & Related papers (2021-12-02T11:29:43Z)
Learning Long-Term Reward Redistribution via Randomized Return Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
Constrained Upper Confidence Reinforcement Learning [12.919486518128734]
This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori. We show that an algorithm C-UCRL achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-delta$.
arXiv Detail & Related papers (2020-01-26T00:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.