Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms
- URL: http://arxiv.org/abs/2202.07496v1
- Date: Tue, 15 Feb 2022 15:04:10 GMT
- Title: Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms
- Authors: Romain Laroche, Remi Tachet
- Abstract summary: In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
- Score: 10.356356383401566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Reinforcement Learning, the optimal action at a given state is dependent
on policy decisions at subsequent states. As a consequence, the learning
targets evolve with time and the policy optimization process must be efficient
at unlearning what it previously learnt. In this paper, we discover that the
policy gradient theorem prescribes policy updates that are slow to unlearn
because of their structural symmetry with respect to the value target. To
increase the unlearning speed, we study a novel policy update: the gradient of
the cross-entropy loss with respect to the action maximizing $q$, but find that
such updates may lead to a decrease in value. Consequently, we introduce a
modified policy update devoid of that flaw, and prove its guarantees of
convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic
assumptions. Further, we assess standard policy updates and our cross-entropy
policy updates along six analytical dimensions. Finally, we empirically
validate our theoretical findings.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - A Parametric Class of Approximate Gradient Updates for Policy
Optimization [47.69337420768319]
We develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function.
We obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality.
arXiv Detail & Related papers (2022-06-17T01:28:38Z) - Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor
Critic under State Distribution Mismatch [29.02336004872336]
We establish the global optimality and convergence rate of an off-policy actor critic algorithm.
Our work goes beyond existing works on the optimality of policy gradient methods.
arXiv Detail & Related papers (2021-11-04T16:48:45Z) - Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.