Related papers: Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2108.08812v1
Date: Thu, 19 Aug 2021 17:27:29 GMT
Title: Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
Authors: Andrea Zanette, Martin J. Wainwright, Emma Brunskill
Abstract summary: Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
Score: 85.50033812217254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor's policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors.

Related papers

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization [37.24692425018]
We study online learning in emphconstrained MDPs (CMDPs) Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs.
arXiv Detail & Related papers (2024-10-03T07:54:04Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees [12.259191000019033]
Actor-critic (AC) methods are widely used in reinforcement learning (RL) We design a joint objective for training the actor and critic in a decision-aware fashion. We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
arXiv Detail & Related papers (2023-05-24T15:34:21Z)
Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning [23.222448307481073]
We propose a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage. Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm. We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T07:53:53Z)
Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space. We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z)
Zeroth-Order Actor-Critic [6.5158195776494]
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture. We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
arXiv Detail & Related papers (2022-01-29T07:09:03Z)
Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL) This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy) This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z)
Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)
Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z)
Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER [6.9478331974594045]
We show convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.
arXiv Detail & Related papers (2020-12-02T18:47:06Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.