Online Optimization for Offline Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2510.22027v1
- Date: Fri, 24 Oct 2025 21:12:47 GMT
- Title: Online Optimization for Offline Safe Reinforcement Learning
- Authors: Yassine Chemingui, Aryan Deshwal, Alan Fern, Thanh Nguyen-Tang, Janardhan Rao Doppa,
- Abstract summary: We study the problem of Offline Safe Reinforcement Learning (OSRL)<n>The goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint.<n>We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms.
- Score: 44.48700237186216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms. We prove the approximate optimality of this approach when integrated with an approximate offline RL oracle and no-regret online optimization. We also present a practical approximation that can be combined with any offline RL algorithm, eliminating the need for offline policy evaluation. Empirical results on the DSRL benchmark demonstrate that our method reliably enforces safety constraints under stringent cost budgets, while achieving high rewards. The code is available at https://github.com/yassineCh/O3SRL.
Related papers
- Offline Safe Policy Optimization From Heterogeneous Feedback [35.454656807434006]
We introduce a framework that learns a policy based on pairwise preferences regarding the agent's behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments.<n>Our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-12-23T09:07:53Z) - MOORL: A Framework for Integrating Offline-Online Reinforcement Learning [6.7265073544042995]
We propose Meta Offline-Online Reinforcement Learning (MOORL), a hybrid framework that unifies offline and online learning.<n>Our theoretical analysis demonstrates that the hybrid approach enhances exploration by effectively combining the complementary strengths of offline and online data.<n>With minimal computational overhead, MOORL achieves strong performance, underscoring its potential for practical applications in real-world scenarios.
arXiv Detail & Related papers (2025-06-11T10:12:50Z) - Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
We introduce A3RL, which incorporates a novel confidence aware Active Advantage Aligned sampling strategy.<n>We demonstrate that our method outperforms competing online RL techniques that leverage offline data.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures [63.36095790552758]
This article reviews the recent advances on the statistical foundation of reinforcement learning (RL) in the offline and low-adaptive settings.<n>We will start by arguing why offline RL is the appropriate model for almost any real-life ML problems, even if they have nothing to do with the recent AI breakthroughs that use RL.<n>We will zoom into two fundamental problems of offline RL: offline policy evaluation (OPE) and offline policy learning (OPL)
arXiv Detail & Related papers (2025-01-03T20:27:53Z) - Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Semi-Offline Reinforcement Learning for Optimized Text Generation [35.1606951874979]
In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline.
Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability.
We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings.
arXiv Detail & Related papers (2023-06-16T09:24:29Z) - Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z) - Behavior Proximal Policy Optimization [14.701955559885615]
offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly.
Online on-policy algorithms are naturally able to solve offline RL.
We propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization.
arXiv Detail & Related papers (2023-02-22T11:49:12Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL)
The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment.
We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.