Related papers: Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

URL: http://arxiv.org/abs/2306.00603v2
Date: Mon, 4 Mar 2024 14:20:17 GMT
Title: Safe Offline Reinforcement Learning with Real-Time Budget Constraints
Authors: Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang
Abstract summary: In many real-world applications, the learned policy is required to respond to dynamically determined safety budgets in real time. We propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
Score: 17.64685813460148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aiming at promoting the safe real-world deployment of Reinforcement Learning (RL), research on safe RL has made significant progress in recent years. However, most existing works in the literature still focus on the online setting where risky violations of the safety budget are likely to be incurred during training. Besides, in many real-world applications, the learned policy is required to respond to dynamically determined safety budgets (i.e., constraint threshold) in real time. In this paper, we target at the above real-time budget constraint problem under the offline setting, and propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution and solves it through diffusion model planning. Theoretically, we prove an error bound of the estimation on the episodic reward and cost under the offline setting and thus provide a performance guarantee for TREBI. Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.

Related papers

Adaptive Budgeted Multi-Armed Bandits for IoT with Dynamic Resource Constraints [5.694070924765916]
Internet of Things systems increasingly operate in environments where devices must respond in real time while managing fluctuating resource constraints.<n>We propose a novel Budgeted Multi-Armed Bandit framework tailored for IoT applications with dynamic operational limits.<n>Our model introduces a decaying violation budget, which permits limited constraint violations early in the learning process and gradually enforces stricter compliance over time.
arXiv Detail & Related papers (2025-05-05T13:33:39Z)
Reward-Safety Balance in Offline Safe RL via Diffusion Regularization [16.5825143820431]
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL) DRCORL first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference.
arXiv Detail & Related papers (2025-02-18T00:00:03Z)
OffRIPP: Offline RL-based Informative Path Planning [12.705099730591671]
IPP is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment. We propose an offline RL-based IPP framework that optimized information gain without requiring real-time interaction during training. We validate the framework through extensive simulations and real-world experiments.
arXiv Detail & Related papers (2024-09-25T11:30:59Z)
Sparsity-based Safety Conservatism for Constrained Offline Reinforcement Learning [4.0847743592744905]
Reinforcement Learning (RL) has made notable success in decision-making fields like autonomous driving and robotic manipulation. RL's training approach, centered on "on-policy" sampling, doesn't fully capitalize on data. offline RL has emerged as a compelling alternative, particularly in conducting additional experiments is impractical.
arXiv Detail & Related papers (2024-07-17T20:57:05Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Sustainable Online Reinforcement Learning for Auto-bidding [10.72140135793476]
State-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline.
arXiv Detail & Related papers (2022-10-13T13:17:20Z)
Enhancing Safe Exploration Using Safety State Augmentation [71.00929878212382]
We tackle the problem of safe exploration in model-free reinforcement learning. We derive policies for scheduling the safety budget during training. We show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
arXiv Detail & Related papers (2022-06-06T15:23:07Z)
COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z)
Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL) The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment. We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR) We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.