Related papers: A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning

A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning

URL: http://arxiv.org/abs/2307.12968v1
Date: Mon, 24 Jul 2023 17:46:32 GMT
Title: A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning
Authors: Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov
Abstract summary: One-step methods perform regularization by doing just a single step of policy improvement. critic regularization methods do many steps of policy improvement with a regularized objective. Applying a multi-step critic regularization method with a regularization coefficient of 1 iteration yields the same policy as one-step RL.
Score: 163.44116192806922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This ``early stopping'' makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that one-step RL might be competitive with critic regularization on RL problems that demand strong regularization.

Related papers

Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL) GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs [31.663072540757643]
Reinforcement Learning (RL) has been applied to real-world problems with increasing success. We introduce Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD)
arXiv Detail & Related papers (2023-02-02T18:05:27Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)
Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian [18.2080757218886]
offline reinforcement learning (RL) refers to decision-making from a previously-collected dataset of interactions. We present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2022-11-01T19:28:48Z)
Instance-Dependent Confidence and Early Stopping for Reinforcement Learning [99.57168572237421]
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. This research provides guarantees that explain textitex post the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice.
arXiv Detail & Related papers (2022-01-21T04:25:35Z)
A Policy Efficient Reduction Approach to Convex Constrained Deep Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method. Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z)
Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.