Related papers: Bellman-consistent Pessimism for Offline Reinforcement Learning

Bellman-consistent Pessimism for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2106.06926v6
Date: Mon, 23 Oct 2023 23:46:15 GMT
Title: Bellman-consistent Pessimism for Offline Reinforcement Learning
Authors: Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal
Abstract summary: We introduce the notion of Bellman-consistent pessimism for general function approximation. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting.
Score: 46.97637726255375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Related papers

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation [29.69428894587431]
In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error. We show that the scaling of the suboptimality with $sqrtvarepsilon_mathrmBE$ cannot be improved for any algorithm.
arXiv Detail & Related papers (2024-06-17T16:04:06Z)
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z)
Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z)
Bandit Social Learning: Exploration under Myopic Behavior [54.767961587919075]
We study social learning dynamics motivated by reviews on online platforms. Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z)
When is Realizability Sufficient for Off-Policy Reinforcement Learning? [17.317841035807696]
We analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error.
arXiv Detail & Related papers (2022-11-10T03:15:31Z)
Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z)
Efficient and Differentiable Conformal Prediction with General Function Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters. We show that it achieves approximate valid population coverage and near-optimal efficiency within class. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z)
Analysis and Optimisation of Bellman Residual Errors with Neural Function Approximation [0.0]
Recent development of Deep Reinforcement Learning has demonstrated superior performance of neural networks in solving challenging problems with large or even continuous state spaces. One specific approach is to deploy neural networks to approximate value by minimising the Mean Squared Bellman Error function.
arXiv Detail & Related papers (2021-06-16T13:35:14Z)
Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL) Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.