Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion
Model
- URL: http://arxiv.org/abs/2401.10700v1
- Date: Fri, 19 Jan 2024 14:05:09 GMT
- Title: Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion
Model
- Authors: Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li,
Xianyuan Zhan, Jingjing Liu
- Abstract summary: We propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward, and offline policy learning.
In FISOR, the translated optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning.
We show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.
- Score: 23.93820548551533
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Safe offline RL is a promising way to bypass risky online interactions
towards safe policy learning. Most existing methods only enforce soft
constraints, i.e., constraining safety violations in expectation below
thresholds predetermined. This can lead to potentially unsafe outcomes, thus
unacceptable in safety-critical scenarios. An alternative is to enforce the
hard constraint of zero violation. However, this can be challenging in offline
setting, as it needs to strike the right balance among three highly intricate
and correlated aspects: safety constraint satisfaction, reward maximization,
and behavior regularization imposed by offline datasets. Interestingly, we
discover that via reachability analysis of safe-control theory, the hard safety
constraint can be equivalently translated to identifying the largest feasible
region given the offline dataset. This seamlessly converts the original trilogy
problem to a feasibility-dependent objective, i.e., maximizing reward value
within the feasible region while minimizing safety risks in the infeasible
region. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline
RL), which allows safety constraint adherence, reward maximization, and offline
policy learning to be realized via three decoupled processes, while offering
strong safety performance and stability. In FISOR, the optimal policy for the
translated optimization problem can be derived in a special form of weighted
behavior cloning. Thus, we propose a novel energy-guided diffusion model that
does not require training a complicated time-dependent classifier to extract
the policy, greatly simplifying the training. We compare FISOR against
baselines on DSRL benchmark for safe offline RL. Evaluation results show that
FISOR is the only method that can guarantee safety satisfaction in all tasks,
while achieving top returns in most tasks.
Related papers
- Reward-Safety Balance in Offline Safe RL via Diffusion Regularization [16.5825143820431]
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints.
We propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL)
DRCORL first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference.
arXiv Detail & Related papers (2025-02-18T00:00:03Z) - FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning [7.888219789657414]
Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints.
Key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution states and actions.
We introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes.
arXiv Detail & Related papers (2024-12-12T02:28:50Z) - Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning [7.888219789657414]
In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints.
We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders.
We frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints.
arXiv Detail & Related papers (2024-12-11T22:00:07Z) - Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement Learning [57.84059344739159]
"Shielding" is a popular technique to enforce safety inReinforcement Learning (RL)
We propose a new permissibility-based framework to deal with safety and shield construction.
arXiv Detail & Related papers (2024-05-29T18:00:21Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Constrained Decision Transformer for Offline Safe Reinforcement Learning [16.485325576173427]
We study the offline safe RL problem from a novel multi-objective optimization perspective.
We propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment.
arXiv Detail & Related papers (2023-02-14T21:27:10Z) - SaFormer: A Conditional Sequence Modeling Approach to Offline Safe
Reinforcement Learning [64.33956692265419]
offline safe RL is of great practical relevance for deploying agents in real-world applications.
We present a novel offline safe RL approach referred to as SaFormer.
arXiv Detail & Related papers (2023-01-28T13:57:01Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z) - Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL)
The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment.
We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.