Bellman Residual Orthogonalization for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2203.12786v1
- Date: Thu, 24 Mar 2022 01:04:17 GMT
- Title: Bellman Residual Orthogonalization for Offline Reinforcement Learning
- Authors: Andrea Zanette and Martin J. Wainwright
- Abstract summary: We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
- Score: 53.17258888552998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new reinforcement learning principle that approximates the
Bellman equations by enforcing their validity only along an user-defined space
of test functions. Focusing on applications to model-free offline RL with
function approximation, we exploit this principle to derive confidence
intervals for off-policy evaluation, as well as to optimize over policies
within a prescribed policy class. We prove an oracle inequality on our policy
optimization procedure in terms of a trade-off between the value and
uncertainty of an arbitrary comparator policy. Different choices of test
function spaces allow us to tackle different problems within a common
framework. We characterize the loss of efficiency in moving from on-policy to
off-policy data using our procedures, and establish connections to
concentrability coefficients studied in past work. We examine in depth the
implementation of our methods with linear function approximation, and provide
theoretical guarantees with polynomial-time implementations even when Bellman
closure does not hold.
Related papers
- Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - On Imitation Learning of Linear Control Policies: Enforcing Stability
and Robustness Constraints via LMI Conditions [3.296303220677533]
We formulate the imitation learning of linear policies as a constrained optimization problem.
We show that one can guarantee the closed-loop stability and robustness by posing linear matrix inequality (LMI) constraints on the fitted policy.
arXiv Detail & Related papers (2021-03-24T02:43:03Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.