SaFormer: A Conditional Sequence Modeling Approach to Offline Safe
Reinforcement Learning
- URL: http://arxiv.org/abs/2301.12203v1
- Date: Sat, 28 Jan 2023 13:57:01 GMT
- Title: SaFormer: A Conditional Sequence Modeling Approach to Offline Safe
Reinforcement Learning
- Authors: Qin Zhang and Linrui Zhang and Haoran Xu and Li Shen and Bowen Wang
and Yongzhe Chang and Xueqian Wang and Bo Yuan and Dacheng Tao
- Abstract summary: offline safe RL is of great practical relevance for deploying agents in real-world applications.
We present a novel offline safe RL approach referred to as SaFormer.
- Score: 64.33956692265419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline safe RL is of great practical relevance for deploying agents in
real-world applications. However, acquiring constraint-satisfying policies from
the fixed dataset is non-trivial for conventional approaches. Even worse, the
learned constraints are stationary and may become invalid when the online
safety requirement changes. In this paper, we present a novel offline safe RL
approach referred to as SaFormer, which tackles the above issues via
conditional sequence modeling. In contrast to existing sequence models, we
propose cost-related tokens to restrict the action space and a posterior safety
verification to enforce the constraint explicitly. Specifically, SaFormer
performs a two-stage auto-regression conditioned by the maximum remaining cost
to generate feasible candidates. It then filters out unsafe attempts and
executes the optimal action with the highest expected return. Extensive
experiments demonstrate the efficacy of SaFormer featuring (1) competitive
returns with tightened constraint satisfaction; (2) adaptability to the
in-range cost values of the offline data without retraining; (3)
generalizability for constraints beyond the current dataset.
Related papers
- Learning General Continuous Constraint from Demonstrations via Positive-Unlabeled Learning [8.361428709513476]
This paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration.
The effectiveness of the proposed method is validated in two Mujoco environments.
arXiv Detail & Related papers (2024-07-23T14:00:18Z) - OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning [30.540598779743455]
Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset.
This paper introduces a new paradigm in offline safe RL designed to overcome these critical limitations.
Our approach makes compliance with safety constraints through effective data utilization and regularization techniques.
arXiv Detail & Related papers (2024-07-19T20:15:00Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion
Model [23.93820548551533]
We propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward, and offline policy learning.
In FISOR, the translated optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning.
We show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.
arXiv Detail & Related papers (2024-01-19T14:05:09Z) - Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning [33.988698754176646]
We introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules.
Our experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance.
This makes our approach suitable for real-world dynamic applications.
arXiv Detail & Related papers (2023-10-05T17:39:02Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Constrained Decision Transformer for Offline Safe Reinforcement Learning [16.485325576173427]
We study the offline safe RL problem from a novel multi-objective optimization perspective.
We propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment.
arXiv Detail & Related papers (2023-02-14T21:27:10Z) - Enhancing Safe Exploration Using Safety State Augmentation [71.00929878212382]
We tackle the problem of safe exploration in model-free reinforcement learning.
We derive policies for scheduling the safety budget during training.
We show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
arXiv Detail & Related papers (2022-06-06T15:23:07Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.