Related papers: Offline RL Without Off-Policy Evaluation

Offline RL Without Off-Policy Evaluation

URL: http://arxiv.org/abs/2106.08909v1
Date: Wed, 16 Jun 2021 16:04:26 GMT
Title: Offline RL Without Off-Policy Evaluation
Authors: David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna
Abstract summary: We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
Score: 49.11859771578969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The simple one-step baseline achieves this strong performance without many of the tricks used by previously proposed iterative algorithms and is more robust to hyperparameters. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those high-variance estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.

Related papers

Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
You Only Evaluate Once: a Simple Baseline Algorithm for Offline RL [29.98260009732724]
We propose a baseline algorithm for offline reinforcement learning that only performs the policy evaluation step once. We empirically find that the proposed algorithm exhibits competitive and sometimes even state-of-the-art performance in a subset of the D4RL offline RL benchmark.
arXiv Detail & Related papers (2021-10-05T19:05:47Z)
Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Proximal Deterministic Policy Gradient [20.951797549505986]
We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms. We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate. We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
arXiv Detail & Related papers (2020-08-03T10:19:59Z)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms. Specifically, we investigate the consequences of "code-level optimizations:" Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.