Related papers: Off-Policy Actor-Critic with Emphatic Weightings

Off-Policy Actor-Critic with Emphatic Weightings

URL: http://arxiv.org/abs/2111.08172v3
Date: Thu, 13 Apr 2023 20:18:25 GMT
Title: Off-Policy Actor-Critic with Emphatic Weightings
Authors: Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White
Abstract summary: off-policy setting has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE)
Score: 27.31795386676574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)--converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.

Related papers

Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient [61.440209025381016]
Policy reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer.<n>We show that under certain constraints, a structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges.<n>We propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer.
arXiv Detail & Related papers (2025-09-02T18:33:11Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z)
Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch [5.684512532326261]
We analyze the impact of a distribution mismatch on the policy gradient methods. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
arXiv Detail & Related papers (2025-03-28T08:52:41Z)
Residual Policy Gradient: A Reward View of KL-regularized Objective [48.39829592175419]
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. Policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. We introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings.
arXiv Detail & Related papers (2025-03-14T02:30:13Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy. Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning [0.46040036610482665]
Cumulative Prospect Theory (CPT) has been developed to provide a better model for human-based decision-making supported by empirical evidence. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem. We show that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
arXiv Detail & Related papers (2024-10-03T15:45:39Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Mollification Effects of Policy Gradient Methods [16.617678267301702]
We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes. We demonstrate the equivalence between policy gradient methods and solving backward heat equations. We make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with policies in RL.
arXiv Detail & Related papers (2024-05-28T05:05:33Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z)
Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions. We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z)
Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O) P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z)
Characterizing the Gap Between Actor-Critic and Policy Gradient [47.77939973964009]
We explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient. We develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG.
arXiv Detail & Related papers (2021-06-13T06:35:42Z)
Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z)
On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function. In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z)
On Linear Convergence of Policy Gradient Methods for Finite MDPs [8.00114449574708]
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings. We show that many variants of policy gradient methods succeed with large step-sizes and attain a linear rate of convergence.
arXiv Detail & Related papers (2020-07-21T22:35:37Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.