Constrained Upper Confidence Reinforcement Learning
- URL: http://arxiv.org/abs/2001.09377v1
- Date: Sun, 26 Jan 2020 00:23:02 GMT
- Title: Constrained Upper Confidence Reinforcement Learning
- Authors: Liyuan Zheng, Lillian J. Ratliff
- Abstract summary: This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori.
We show that an algorithm C-UCRL achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-delta$.
- Score: 12.919486518128734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Constrained Markov Decision Processes are a class of stochastic decision
problems in which the decision maker must select a policy that satisfies
auxiliary cost constraints. This paper extends upper confidence reinforcement
learning for settings in which the reward function and the constraints,
described by cost functions, are unknown a priori but the transition kernel is
known. Such a setting is well-motivated by a number of applications including
exploration of unknown, potentially unsafe, environments. We present an
algorithm C-UCRL and show that it achieves sub-linear regret ($
O(T^{\frac{3}{4}}\sqrt{\log(T/\delta)})$) with respect to the reward while
satisfying the constraints even while learning with probability $1-\delta$.
Illustrative examples are provided.
Related papers
- Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints [34.7178680288326]
In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining sublinear regret and sublinear constraint violation.
We show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.
arXiv Detail & Related papers (2024-05-23T09:48:48Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Recursive Constraints to Prevent Instability in Constrained
Reinforcement Learning [16.019477271828745]
We consider the challenge of finding a deterministic policy for a Markov decision process.
This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability.
We present a suitable constrained reinforcement learning algorithm that prevents learning instability.
arXiv Detail & Related papers (2022-01-20T02:33:24Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - Safe Adaptive Learning-based Control for Constrained Linear Quadratic
Regulators with Regret Guarantees [11.627320138064684]
We study the adaptive control of an unknown linear system with a quadratic cost function subject to safety constraints on both the states and actions.
Our algorithm is implemented on a single trajectory and does not require system restarts.
arXiv Detail & Related papers (2021-10-31T05:52:42Z) - Confidence-Budget Matching for Sequential Budgeted Learning [69.77435313099366]
We formalize decision-making problems with querying budget.
We consider multi-armed bandits, linear bandits, and reinforcement learning problems.
We show that CBM based algorithms perform well in the presence of adversity.
arXiv Detail & Related papers (2021-02-05T19:56:31Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.