State-Aware Proximal Pessimistic Algorithms for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2211.15065v1
- Date: Mon, 28 Nov 2022 04:56:40 GMT
- Title: State-Aware Proximal Pessimistic Algorithms for Offline Reinforcement
Learning
- Authors: Chen Chen, Hongyao Tang, Yi Ma, Chao Wang, Qianli Shen, Dong Li,
Jianye Hao
- Abstract summary: Pessimism is of great importance in offline reinforcement learning (RL)
We propose a principled algorithmic framework for offline RL, called emphState-Aware Proximal Pessimism (SA-PP)
- Score: 36.34691755377286
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pessimism is of great importance in offline reinforcement learning (RL). One
broad category of offline RL algorithms fulfills pessimism by explicit or
implicit behavior regularization. However, most of them only consider policy
divergence as behavior regularization, ignoring the effect of how the offline
state distribution differs with that of the learning policy, which may lead to
under-pessimism for some states and over-pessimism for others. Taking account
of this problem, we propose a principled algorithmic framework for offline RL,
called \emph{State-Aware Proximal Pessimism} (SA-PP). The key idea of SA-PP is
leveraging discounted stationary state distribution ratios between the learning
policy and the offline dataset to modulate the degree of behavior
regularization in a state-wise manner, so that pessimism can be implemented in
a more appropriate way. We first provide theoretical justifications on the
superiority of SA-PP over previous algorithms, demonstrating that SA-PP
produces a lower suboptimality upper bound in a broad range of settings.
Furthermore, we propose a new algorithm named \emph{State-Aware Conservative
Q-Learning} (SA-CQL), by building SA-PP upon representative CQL algorithm with
the help of DualDICE for estimating discounted stationary state distribution
ratios. Extensive experiments on standard offline RL benchmark show that SA-CQL
outperforms the popular baselines on a large portion of benchmarks and attains
the highest average return.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - STEEL: Singularity-aware Reinforcement Learning [14.424199399139804]
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy.
We propose a new batch RL algorithm that allows for singularity for both state and action spaces.
By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T18:29:35Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.