Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage
- URL: http://arxiv.org/abs/2107.06226v1
- Date: Tue, 13 Jul 2021 16:30:01 GMT
- Title: Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling
under Partial Coverage
- Authors: Masatoshi Uehara, Wen Sun
- Abstract summary: We study model-based offline Reinforcement Learning with general function approximation.
We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
- Score: 33.766012922307084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study model-based offline Reinforcement Learning with general function
approximation. We present an algorithm named Constrained Pessimistic Policy
Optimization (CPPO) which leverages a general function class and uses a
constraint to encode pessimism. Under the assumption that the ground truth
model belongs to our function class, CPPO can learn with the offline data only
providing partial coverage, i.e., it can learn a policy that completes against
any policy that is covered by the offline data, in polynomial sample complexity
with respect to the statistical complexity of the function class. We then
demonstrate that this algorithmic framework can be applied to many specialized
Markov Decision Processes where the additional structural assumptions can
further refine the concept of partial coverage. One notable example is low-rank
MDP with representation learning where the partial coverage is defined using
the concept of relative condition number measured by the underlying unknown
ground truth feature representation. Finally, we introduce and study the
Bayesian setting in offline RL. The key benefit of Bayesian offline RL is that
algorithmically, we do not need to explicitly construct pessimism or reward
penalty which could be hard beyond models with linear structures. We present a
posterior sampling-based incremental policy optimization algorithm (PS-PO)
which proceeds by iteratively sampling a model from the posterior distribution
and performing one-step incremental policy optimization inside the sampled
model. Theoretically, in expectation with respect to the prior distribution,
PS-PO can learn a near optimal policy under partial coverage with polynomial
sample complexity.
Related papers
- REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Revisiting the Linear-Programming Framework for Offline RL with General
Function Approximation [24.577243536475233]
offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset.
Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators.
We revisit the linear-programming framework for offline RL, and advance the existing results in several aspects.
arXiv Detail & Related papers (2022-12-28T15:28:12Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - PAC Reinforcement Learning for Predictive State Representations [60.00237613646686]
We study online Reinforcement Learning (RL) in partially observable dynamical systems.
We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models.
We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scalingly.
arXiv Detail & Related papers (2022-07-12T17:57:17Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.