You Only Evaluate Once: a Simple Baseline Algorithm for Offline RL
- URL: http://arxiv.org/abs/2110.02304v1
- Date: Tue, 5 Oct 2021 19:05:47 GMT
- Title: You Only Evaluate Once: a Simple Baseline Algorithm for Offline RL
- Authors: Wonjoon Goo, Scott Niekum
- Abstract summary: We propose a baseline algorithm for offline reinforcement learning that only performs the policy evaluation step once.
We empirically find that the proposed algorithm exhibits competitive and sometimes even state-of-the-art performance in a subset of the D4RL offline RL benchmark.
- Score: 29.98260009732724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of offline reinforcement learning (RL) is to find an optimal policy
given prerecorded trajectories. Many current approaches customize existing
off-policy RL algorithms, especially actor-critic algorithms in which policy
evaluation and improvement are iterated. However, the convergence of such
approaches is not guaranteed due to the use of complex non-linear function
approximation and an intertwined optimization process. By contrast, we propose
a simple baseline algorithm for offline RL that only performs the policy
evaluation step once so that the algorithm does not require complex
stabilization schemes. Since the proposed algorithm is not likely to converge
to an optimal policy, it is an appropriate baseline for actor-critic algorithms
that ought to be outperformed if there is indeed value in iterative
optimization in the offline setting. Surprisingly, we empirically find that the
proposed algorithm exhibits competitive and sometimes even state-of-the-art
performance in a subset of the D4RL offline RL benchmark. This result suggests
that future work is needed to fully exploit the potential advantages of
iterative optimization in order to justify the reduced stability of such
methods.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization [71.03797261151605]
Adaptivity is an important yet under-studied property in modern optimization theory.
Our algorithm is proved to achieve the best-available convergence for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
arXiv Detail & Related papers (2020-02-13T05:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.