Related papers: Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

URL: http://arxiv.org/abs/2202.07496v1
Date: Tue, 15 Feb 2022 15:04:10 GMT
Title: Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
Authors: Romain Laroche, Remi Tachet
Abstract summary: In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
Score: 10.356356383401566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.

Related papers

Achieve Performatively Optimal Policy for Performative Reinforcement Learning [55.983627302691424]
This work proposes a zeroth-order FrankWolfe- (0FW) algorithm with a gradient of performative policy in the framework.<n> Experimental results demonstrate that our 0FW is more effective than the existing approximation in finding the desired PO policy.
arXiv Detail & Related papers (2025-10-06T01:56:31Z)
Relative Entropy Pathwise Policy Optimization [56.86405621176669]
We show how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data.<n>We propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy. We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
A Parametric Class of Approximate Gradient Updates for Policy Optimization [47.69337420768319]
We develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. We obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality.
arXiv Detail & Related papers (2022-06-17T01:28:38Z)
Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch [29.02336004872336]
We establish the global optimality and convergence rate of an off-policy actor critic algorithm. Our work goes beyond existing works on the optimality of policy gradient methods.
arXiv Detail & Related papers (2021-11-04T16:48:45Z)
Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update. We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.