Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning
- URL: http://arxiv.org/abs/2408.07962v1
- Date: Thu, 15 Aug 2024 06:18:50 GMT
- Title: Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning
- Authors: Homayoun Honari, Amir Mehdi Soufi Enayati, Mehran Ghafarian Tamizi, Homayoun Najjaran,
- Abstract summary: Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods.
We propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag)
Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold.
- Score: 2.7898966850590625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods with the intention to be deployed on real-world systems. In safe RL, the goal is to maximize reward performance while minimizing constraints, often achieved by setting bounds on constraint functions and utilizing the Lagrangian method. However, deploying Lagrangian-based safe RL in real-world scenarios is challenging due to the necessity of threshold fine-tuning, as imprecise adjustments may lead to suboptimal policy convergence. To mitigate this challenge, we propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag). Meta SAC-Lag uses meta-gradient optimization to automatically update the safety-related hyperparameters. The proposed method is designed to address safe exploration and threshold adjustment with minimal hyperparameter tuning requirement. In our pipeline, the inner parameters are updated through the conventional formulation and the hyperparameters are adjusted using the meta-objectives which are defined based on the updated parameters. Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold. We evaluate the performance of Meta SAC-Lag in five simulated environments against Lagrangian baselines, and the results demonstrate its capability to create synergy between parameters, yielding better or competitive results. Furthermore, we conduct a real-world experiment involving a robotic arm tasked with pouring coffee into a cup without spillage. Meta SAC-Lag is successfully trained to execute the task, while minimizing effort constraints.
Related papers
- Constrained Meta Agnostic Reinforcement Learning [2.3749120526936465]
Constraint Model Agnostic Meta Learning (C-MAML)
C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase.
We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.
arXiv Detail & Related papers (2024-06-20T07:11:27Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Constrained Policy Optimization via Bayesian World Models [79.0077602277004]
LAMBDA is a model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes.
We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
arXiv Detail & Related papers (2022-01-24T17:02:22Z) - Model-based Chance-Constrained Reinforcement Learning via Separated
Proportional-Integral Lagrangian [5.686699342802045]
We propose a separated proportional-integral Lagrangian algorithm to enhance RL safety under uncertainty.
We demonstrate our method can reduce the oscillations and conservatism of RL policy in a car-following simulation.
arXiv Detail & Related papers (2021-08-26T07:34:14Z) - Responsive Safety in Reinforcement Learning by PID Lagrangian Methods [74.49173841304474]
Lagrangian methods exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior.
We propose a novel Lagrange multiplier update method that utilizes derivatives of the constraint function.
We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark.
arXiv Detail & Related papers (2020-07-08T08:43:14Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.