Conditional Sequence Modeling for Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2602.08584v1
- Date: Mon, 09 Feb 2026 12:22:57 GMT
- Title: Conditional Sequence Modeling for Safe Reinforcement Learning
- Authors: Wensong Bai, Chao Zhang, Qihang Xu, Chufan Chen, Chenhao Zhou, Hui Qian,
- Abstract summary: offline safe reinforcement learning aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints.<n>Most existing offline safe RL methods are trained under a pre-specified threshold.<n>We propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy.
- Score: 8.858563919623082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.
Related papers
- Latent Policy Steering through One-Step Flow Policies [34.06099184809882]
offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration.<n>Latent Policy Steering (LPS) enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy.<n>Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
arXiv Detail & Related papers (2026-03-05T15:38:08Z) - Boundary-to-Region Supervision for Offline Safe Reinforcement Learning [56.150983204962735]
Boundary-to-Region (B2R) is a framework that enables asymmetric conditioning through cost signal realignment.<n>B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories.<n> Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks.
arXiv Detail & Related papers (2025-09-30T03:38:20Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning [11.666700714916065]
Constrained RL is a framework for enforcing safe actions in Reinforcement Learning.
Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem.
We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
arXiv Detail & Related papers (2023-12-16T08:48:46Z) - Handling Long and Richly Constrained Tasks through Constrained
Hierarchical Reinforcement Learning [20.280636126917614]
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories.
We propose a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent with a low-level goal conditioned RL agent.
A major advantage of CoSHRL is that it can handle constraints on the cost value distribution and can adjust to flexible constraint thresholds without retraining.
arXiv Detail & Related papers (2023-02-21T12:57:12Z) - SaFormer: A Conditional Sequence Modeling Approach to Offline Safe
Reinforcement Learning [64.33956692265419]
offline safe RL is of great practical relevance for deploying agents in real-world applications.
We present a novel offline safe RL approach referred to as SaFormer.
arXiv Detail & Related papers (2023-01-28T13:57:01Z) - BCRLSP: An Offline Reinforcement Learning Framework for Sequential
Targeted Promotion [8.499811428928071]
We propose the Budget Constrained Reinforcement Learning for Sequential Promotion framework to determine the value of cash bonuses to be sent to users.
We show that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines.
arXiv Detail & Related papers (2022-07-16T00:10:12Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.