Adaptive Primal-Dual Method for Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2402.00355v1
- Date: Thu, 1 Feb 2024 05:53:44 GMT
- Title: Adaptive Primal-Dual Method for Safe Reinforcement Learning
- Authors: Weiqin Chen, James Onyejizu, Long Vu, Lan Hoang, Dharmashankar
Subramanian, Koushik Kar, Sandipan Mishra and Santiago Paternain
- Abstract summary: We propose, analyze and evaluate adaptive primal-dual (APD) methods for Safe Reinforcement Learning (SRL)
Two adaptive LRs are adjusted to the Lagrangian multipliers so as to optimize the policy in each iteration.
Experiments show that the practical APD algorithm outperforms (or achieves comparable performance) and attains more stable training than the constant LR cases.
- Score: 9.5147410074115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Primal-dual methods have a natural application in Safe Reinforcement Learning
(SRL), posed as a constrained policy optimization problem. In practice however,
applying primal-dual methods to SRL is challenging, due to the inter-dependency
of the learning rate (LR) and Lagrangian multipliers (dual variables) each time
an embedded unconstrained RL problem is solved. In this paper, we propose,
analyze and evaluate adaptive primal-dual (APD) methods for SRL, where two
adaptive LRs are adjusted to the Lagrangian multipliers so as to optimize the
policy in each iteration. We theoretically establish the convergence,
optimality and feasibility of the APD algorithm. Finally, we conduct numerical
evaluation of the practical APD algorithm with four well-known environments in
Bullet-Safey-Gym employing two state-of-the-art SRL algorithms: PPO-Lagrangian
and DDPG-Lagrangian. All experiments show that the practical APD algorithm
outperforms (or achieves comparable performance) and attains more stable
training than the constant LR cases. Additionally, we substantiate the
robustness of selecting the two adaptive LRs by empirical evidence.
Related papers
- One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field.
We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Mixed Reinforcement Learning with Additive Stochastic Uncertainty [19.229447330293546]
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency.
This paper presents a mixed RL algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy.
The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of non-affine nonlinear systems.
arXiv Detail & Related papers (2020-02-28T08:02:34Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.