Time-Efficient Reinforcement Learning with Stochastic Stateful Policies
- URL: http://arxiv.org/abs/2311.04082v1
- Date: Tue, 7 Nov 2023 15:48:07 GMT
- Title: Time-Efficient Reinforcement Learning with Stochastic Stateful Policies
- Authors: Firas Al-Hafez and Guoping Zhao and Jan Peters and Davide Tateo
- Abstract summary: We present a novel approach for training stateful policies by decomposing the latter into a gradient internal state kernel and a stateless policy.
We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning algorithms.
- Score: 20.545058017790428
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Stateful policies play an important role in reinforcement learning, such as
handling partially observable environments, enhancing robustness, or imposing
an inductive bias directly into the policy structure. The conventional method
for training stateful policies is Backpropagation Through Time (BPTT), which
comes with significant drawbacks, such as slow training due to sequential
gradient propagation and the occurrence of vanishing or exploding gradients.
The gradient is often truncated to address these issues, resulting in a biased
policy update. We present a novel approach for training stateful policies by
decomposing the latter into a stochastic internal state kernel and a stateless
policy, jointly optimized by following the stateful policy gradient. We
introduce different versions of the stateful policy gradient theorem, enabling
us to easily instantiate stateful variants of popular reinforcement learning
and imitation learning algorithms. Furthermore, we provide a theoretical
analysis of our new gradient estimator and compare it with BPTT. We evaluate
our approach on complex continuous control tasks, e.g., humanoid locomotion,
and demonstrate that our gradient estimator scales effectively with task
complexity while offering a faster and simpler alternative to BPTT.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Optimization Landscape of Policy Gradient Methods for Discrete-time
Static Output Feedback [22.21598324895312]
This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback control.
We derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods.
We provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when near such minima.
arXiv Detail & Related papers (2023-10-29T14:25:57Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Faster Policy Learning with Continuous-Time Gradients [6.457260875902829]
We study the estimation of policy gradients for continuous-time systems with known dynamics.
By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator.
arXiv Detail & Related papers (2020-12-12T00:22:56Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.