Double Pessimism is Provably Efficient for Distributionally Robust
Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
- URL: http://arxiv.org/abs/2305.09659v2
- Date: Tue, 22 Aug 2023 14:23:10 GMT
- Title: Double Pessimism is Provably Efficient for Distributionally Robust
Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
- Authors: Jose Blanchet, Miao Lu, Tong Zhang, Han Zhong
- Abstract summary: We study robust offline reinforcement learning (robust offline RL)
We propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P2MPO$)
We show that $P2MPO$ enjoys a $tildemathcalO(n-1/2)$ convergence rate, where $n$ is the dataset size.
- Score: 15.858892479232656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study distributionally robust offline reinforcement
learning (robust offline RL), which seeks to find an optimal policy purely from
an offline dataset that can perform well in perturbed environments. In
specific, we propose a generic algorithm framework called Doubly Pessimistic
Model-based Policy Optimization ($P^2MPO$), which features a novel combination
of a flexible model estimation subroutine and a doubly pessimistic policy
optimization step. Notably, the double pessimism principle is crucial to
overcome the distributional shifts incurred by (i) the mismatch between the
behavior policy and the target policies; and (ii) the perturbation of the
nominal model. Under certain accuracy conditions on the model estimation
subroutine, we prove that $P^2MPO$ is sample-efficient with robust partial
coverage data, which only requires the offline data to have good coverage of
the distributions induced by the optimal robust policy and the perturbed models
around the nominal model.
By tailoring specific model estimation subroutines for concrete examples of
RMDPs, including tabular RMDPs, factored RMDPs, kernel and neural RMDPs, we
prove that $P^2MPO$ enjoys a $\tilde{\mathcal{O}}(n^{-1/2})$ convergence rate,
where $n$ is the dataset size. We highlight that all these examples, except
tabular RMDPs, are first identified and proven tractable by this work.
Furthermore, we continue our study of robust offline RL in the robust Markov
games (RMGs). By extending the double pessimism principle identified for
single-agent RMDPs, we propose another algorithm framework that can efficiently
find the robust Nash equilibria among players using only robust unilateral
(partial) coverage data. To our best knowledge, this work proposes the first
general learning principle -- double pessimism -- for robust offline RL and
shows that it is provably efficient with general function approximation.
Related papers
- Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data [16.995406965407003]
We propose a model-free algorithm called Robust $phi$-regularized fitted Q-iteration (RPQ) for learning an $epsilon$-optimal robust policy.
We also introduce the hybrid robust $phi$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling.
arXiv Detail & Related papers (2024-05-08T23:52:37Z) - Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL)
We introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE.
We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE)
arXiv Detail & Related papers (2024-02-14T07:27:30Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Importance Weighted Actor-Critic for Optimal Conservative Offline
Reinforcement Learning [23.222448307481073]
We propose a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage.
Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm.
We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T07:53:53Z) - Online Policy Optimization for Robust MDP [17.995448897675068]
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
In this work, we consider online robust Markov decision process (MDP) by interacting with an unknown nominal system.
We propose a robust optimistic policy optimization algorithm that is provably efficient.
arXiv Detail & Related papers (2022-09-28T05:18:20Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Sample Complexity of Robust Reinforcement Learning with a Generative
Model [0.0]
We propose a model-based reinforcement learning (RL) algorithm for learning an $epsilon$-optimal robust policy.
We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence.
In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies.
arXiv Detail & Related papers (2021-12-02T18:55:51Z) - Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution.
This approach poses a number of implementation and optimization challenges.
We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.