Safe Offline Reinforcement Learning with Real-Time Budget Constraints
- URL: http://arxiv.org/abs/2306.00603v2
- Date: Mon, 4 Mar 2024 14:20:17 GMT
- Title: Safe Offline Reinforcement Learning with Real-Time Budget Constraints
- Authors: Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie,
Xingxing Wang, Dong Wang
- Abstract summary: In many real-world applications, the learned policy is required to respond to dynamically determined safety budgets in real time.
We propose Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that models this problem from the perspective of trajectory distribution.
Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
- Score: 17.64685813460148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aiming at promoting the safe real-world deployment of Reinforcement Learning
(RL), research on safe RL has made significant progress in recent years.
However, most existing works in the literature still focus on the online
setting where risky violations of the safety budget are likely to be incurred
during training. Besides, in many real-world applications, the learned policy
is required to respond to dynamically determined safety budgets (i.e.,
constraint threshold) in real time. In this paper, we target at the above
real-time budget constraint problem under the offline setting, and propose
Trajectory-based REal-time Budget Inference (TREBI) as a novel solution that
models this problem from the perspective of trajectory distribution and solves
it through diffusion model planning. Theoretically, we prove an error bound of
the estimation on the episodic reward and cost under the offline setting and
thus provide a performance guarantee for TREBI. Empirical results on a wide
range of simulation tasks and a real-world large-scale advertising application
demonstrate the capability of TREBI in solving real-time budget constraint
problems under offline settings.
Related papers
- Sparsity-based Safety Conservatism for Constrained Offline Reinforcement Learning [4.0847743592744905]
Reinforcement Learning (RL) has made notable success in decision-making fields like autonomous driving and robotic manipulation.
RL's training approach, centered on "on-policy" sampling, doesn't fully capitalize on data.
offline RL has emerged as a compelling alternative, particularly in conducting additional experiments is impractical.
arXiv Detail & Related papers (2024-07-17T20:57:05Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Sustainable Online Reinforcement Learning for Auto-bidding [10.72140135793476]
State-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers.
Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS.
In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline.
arXiv Detail & Related papers (2022-10-13T13:17:20Z) - Enhancing Safe Exploration Using Safety State Augmentation [71.00929878212382]
We tackle the problem of safe exploration in model-free reinforcement learning.
We derive policies for scheduling the safety budget during training.
We show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
arXiv Detail & Related papers (2022-06-06T15:23:07Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL)
The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment.
We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.