Related papers: OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

URL: http://arxiv.org/abs/2602.11018v1
Date: Wed, 11 Feb 2026 16:41:16 GMT
Title: OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories
Authors: Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran,
Abstract summary: This work addresses the problem of offline safe imitation learning (IL)<n>The goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information.<n>We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations.
Score: 5.52395321369933
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

Related papers

Offline Safe Policy Optimization From Heterogeneous Feedback [35.454656807434006]
We introduce a framework that learns a policy based on pairwise preferences regarding the agent's behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments.<n>Our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-12-23T09:07:53Z)
SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories [5.52395321369933]
We study the problem of offline safe imitation learning (IL)<n>In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky.<n>The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety.
arXiv Detail & Related papers (2025-11-11T11:44:20Z)
Probabilistic Shielding for Safe Reinforcement Learning [51.35559820893218]
In real-life scenarios, a Reinforcement Learning (RL) agent must often also behave in a safe manner, including at training time.<n>We present a new, scalable method, which enjoys strict formal guarantees for Safe RL.<n>We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time.
arXiv Detail & Related papers (2025-03-09T17:54:33Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints [15.904640266226023]
We design a safety model that performs credit assignment to assess contributions of partial state-action trajectories on safety. We derive an effective algorithm for optimizing a safe policy using the learned safety model. We devise a method to dynamically adapt the tradeoff coefficient between safety reward and safety compliance.
arXiv Detail & Related papers (2024-05-05T17:27:22Z)
Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline Demonstration [75.51109230296568]
We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. We propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework. GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms.
arXiv Detail & Related papers (2023-09-18T00:22:59Z)
Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy. Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z)
SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition [59.94644674087599]
We propose SAFEty skill pRiors (SAFER), an algorithm that accelerates policy learning on complex control tasks under safety constraints. Through principled training on an offline dataset, SAFER learns to extract safe primitive skills. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies.
arXiv Detail & Related papers (2022-02-10T05:43:41Z)
Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations [64.39401322671803]
This paper explores the possibility of safe RL algorithms with zero training-time safety violations. We propose an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies.
arXiv Detail & Related papers (2021-08-04T04:59:05Z)
Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality? [0.0]
Reinforcement Learning (RL) still struggles to deliver formal guarantees on the closed-loop behavior of the learned policy. Recent contributions propose to rely on projections of the inputs delivered by the learned policy into a safe set. This paper addresses this issue in the context of $Q$-learning and policy gradient techniques.
arXiv Detail & Related papers (2020-04-02T10:11:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.