Bellman-consistent Pessimism for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2106.06926v6
- Date: Mon, 23 Oct 2023 23:46:15 GMT
- Title: Bellman-consistent Pessimism for Offline Reinforcement Learning
- Authors: Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal
- Abstract summary: We introduce the notion of Bellman-consistent pessimism for general function approximation.
Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting.
- Score: 46.97637726255375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of pessimism, when reasoning about datasets lacking exhaustive
exploration has recently gained prominence in offline reinforcement learning.
Despite the robustness it adds to the algorithm, overly pessimistic reasoning
can be equally damaging in precluding the discovery of good policies, which is
an issue for the popular bonus-based pessimism. In this paper, we introduce the
notion of Bellman-consistent pessimism for general function approximation:
instead of calculating a point-wise lower bound for the value function, we
implement pessimism at the initial state over the set of functions consistent
with the Bellman equations. Our theoretical guarantees only require Bellman
closedness as standard in the exploratory setting, in which case bonus-based
pessimism fails to provide guarantees. Even in the special case of linear
function approximation where stronger expressivity assumptions hold, our result
improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample
complexity when the action space is finite. Remarkably, our algorithms
automatically adapt to the best bias-variance tradeoff in the hindsight,
whereas most prior approaches require tuning extra hyperparameters a priori.
Related papers
- The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation [29.69428894587431]
In this paper, we study the offline RL problem with linear function approximation.
Our main structural assumption is that the MDP has low inherent Bellman error.
We show that the scaling of the suboptimality with $sqrtvarepsilon_mathrmBE$ cannot be improved for any algorithm.
arXiv Detail & Related papers (2024-06-17T16:04:06Z) - Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO.
We obtain statistical guarantees analogous to those for prior pessimistic approaches.
We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z) - Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs.
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation.
We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z) - When is Realizability Sufficient for Off-Policy Reinforcement Learning? [17.317841035807696]
We analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class.
We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error.
arXiv Detail & Related papers (2022-11-10T03:15:31Z) - Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting.
We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism.
We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - Analysis and Optimisation of Bellman Residual Errors with Neural
Function Approximation [0.0]
Recent development of Deep Reinforcement Learning has demonstrated superior performance of neural networks in solving challenging problems with large or even continuous state spaces.
One specific approach is to deploy neural networks to approximate value by minimising the Mean Squared Bellman Error function.
arXiv Detail & Related papers (2021-06-16T13:35:14Z) - Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL)
Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.