Characterizing the Gap Between Actor-Critic and Policy Gradient
- URL: http://arxiv.org/abs/2106.06932v1
- Date: Sun, 13 Jun 2021 06:35:42 GMT
- Title: Characterizing the Gap Between Actor-Critic and Policy Gradient
- Authors: Junfeng Wen, Saurabh Kumar, Ramki Gummadi, Dale Schuurmans
- Abstract summary: We explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient.
We develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG.
- Score: 47.77939973964009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Actor-critic (AC) methods are ubiquitous in reinforcement learning. Although
it is understood that AC methods are closely related to policy gradient (PG),
their precise connection has not been fully characterized previously. In this
paper, we explain the gap between AC and PG methods by identifying the exact
adjustment to the AC objective/gradient that recovers the true policy gradient
of the cumulative reward objective (PG). Furthermore, by viewing the AC method
as a two-player Stackelberg game between the actor and critic, we show that the
Stackelberg policy gradient can be recovered as a special case of our more
general analysis. Based on these results, we develop practical algorithms,
Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the
correction between AC and PG and use these to modify the standard AC algorithm.
Experiments on popular tabular and continuous environments show the proposed
corrections can improve both the sample efficiency and final performance of
existing AC methods.
Related papers
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [50.856589224454055]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>We propose regularized policy gradient (RPG), a framework for deriving and analyzing KL-regularized policy gradient methods in the online reinforcement learning setting.<n>RPG shows improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Value Improved Actor Critic Algorithms [5.617360550806964]
We propose a general extension to the AC framework that employs two separate improvement operators.
We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG.
We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.
arXiv Detail & Related papers (2024-06-03T15:24:15Z) - Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - Actor-Critic based Improper Reinforcement Learning [61.430513757337486]
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process.
We propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic scheme and a Natural Actor-Critic scheme.
arXiv Detail & Related papers (2022-07-19T05:55:02Z) - Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality [65.67315418971688]
Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR) are proposed.
Experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization.
arXiv Detail & Related papers (2022-07-05T15:39:29Z) - Off-Policy Actor-Critic with Emphatic Weightings [27.31795386676574]
off-policy setting has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem.
In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective.
We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE)
arXiv Detail & Related papers (2021-11-16T01:18:16Z) - Training Generative Adversarial Networks with Adaptive Composite
Gradient [2.471982349512685]
This paper proposes the adaptive Composite Gradients (ACG) method, linearly convergent in bilinear games.
ACG is a semi-gradient-free algorithm since it does not need to calculate the gradient of each step.
Results show ACG is competitive with the previous algorithms.
arXiv Detail & Related papers (2021-11-10T03:13:53Z) - Cautious Actor-Critic [11.82492300303637]
We propose a novel off-policy AC algorithm cautious actor-critic (CAC)
We show that CAC achieves comparable performance while significantly stabilizes learning.
arXiv Detail & Related papers (2021-07-12T06:40:02Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.