OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2407.14653v1
- Date: Fri, 19 Jul 2024 20:15:00 GMT
- Title: OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning
- Authors: Yihang Yao, Zhepeng Cen, Wenhao Ding, Haohong Lin, Shiqi Liu, Tingnan Zhang, Wenhao Yu, Ding Zhao,
- Abstract summary: Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset.
This paper introduces a new paradigm in offline safe RL designed to overcome these critical limitations.
Our approach makes compliance with safety constraints through effective data utilization and regularization techniques.
- Score: 30.540598779743455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.
Related papers
- Reward-Safety Balance in Offline Safe RL via Diffusion Regularization [16.5825143820431]
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints.
We propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL)
DRCORL first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference.
arXiv Detail & Related papers (2025-02-18T00:00:03Z) - Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
A3 RL is a novel method that actively selects data from combined online and offline sources to optimize policy improvement.
We provide theoretical guarantee that validates the effectiveness of our active sampling strategy.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning [7.888219789657414]
Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints.
Key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution states and actions.
We introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes.
arXiv Detail & Related papers (2024-12-12T02:28:50Z) - FOSP: Fine-tuning Offline Safe Policy through World Models [3.7971075341023526]
Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks.
However, prior works still pose safety challenges due to the online exploration in real-world deployment.
In this paper, we aim to further enhance safety during the deployment stage for vision-based robotic tasks by fine-tuning an offline-trained policy.
arXiv Detail & Related papers (2024-07-06T03:22:57Z) - Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion
Model [23.93820548551533]
We propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward, and offline policy learning.
In FISOR, the translated optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning.
We show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.
arXiv Detail & Related papers (2024-01-19T14:05:09Z) - Guided Online Distillation: Promoting Safe Reinforcement Learning by
Offline Demonstration [75.51109230296568]
We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue.
We propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework.
GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms.
arXiv Detail & Related papers (2023-09-18T00:22:59Z) - A Simple Unified Uncertainty-Guided Framework for Offline-to-Online
Reinforcement Learning [25.123237633748193]
offline-to-online reinforcement learning can be challenging due to constrained exploratory behavior and state-action distribution shift.
We propose a Simple Unified uNcertainty-Guided (SUNG) framework, which unifies the solution to both challenges with the tool of uncertainty.
SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods.
arXiv Detail & Related papers (2023-06-13T05:22:26Z) - Online Safety Property Collection and Refinement for Safe Deep
Reinforcement Learning in Mapless Navigation [79.89605349842569]
We introduce the Collection and Refinement of Online Properties (CROP) framework to design properties at training time.
CROP employs a cost signal to identify unsafe interactions and use them to shape safety properties.
We evaluate our approach in several robotic mapless navigation tasks and demonstrate that the violation metric computed with CROP allows higher returns and lower violations over previous Safe DRL approaches.
arXiv Detail & Related papers (2023-02-13T21:19:36Z) - SaFormer: A Conditional Sequence Modeling Approach to Offline Safe
Reinforcement Learning [64.33956692265419]
offline safe RL is of great practical relevance for deploying agents in real-world applications.
We present a novel offline safe RL approach referred to as SaFormer.
arXiv Detail & Related papers (2023-01-28T13:57:01Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Constrained Policy Optimization via Bayesian World Models [79.0077602277004]
LAMBDA is a model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes.
We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
arXiv Detail & Related papers (2022-01-24T17:02:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.