On Bellman's principle of optimality and Reinforcement learning for
safety-constrained Markov decision process
- URL: http://arxiv.org/abs/2302.13152v3
- Date: Wed, 12 Jul 2023 11:11:05 GMT
- Title: On Bellman's principle of optimality and Reinforcement learning for
safety-constrained Markov decision process
- Authors: Rahul Misra, Rafa{\l} Wisniewski and Carsten Skovmose Kalles{\o}e
- Abstract summary: We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning.
We construct a modified $Q$-learning algorithm for learning the Lagrangian from data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study optimality for the safety-constrained Markov decision process which
is the underlying framework for safe reinforcement learning. Specifically, we
consider a constrained Markov decision process (with finite states and finite
actions) where the goal of the decision maker is to reach a target set while
avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the
underlying Markov chain for any control policy will be multichain since by
definition there exists a target set and an unsafe set. The decision maker also
has to be optimal (with respect to a cost function) while navigating to the
target set. This gives rise to a multi-objective optimization problem. We
highlight the fact that Bellman's principle of optimality may not hold for
constrained Markov decision problems with an underlying multichain structure
(as shown by the counterexample due to Haviv. We resolve the counterexample by
formulating the aforementioned multi-objective optimization problem as a
zero-sum game and thereafter construct an asynchronous value iteration scheme
for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the
reinforcement learning problem for the same and construct a modified
$Q$-learning algorithm for learning the Lagrangian from data. We also provide a
lower bound on the number of iterations required for learning the Lagrangian
and corresponding error bounds.
Related papers
- Robust Q-Learning for finite ambiguity sets [2.3020018305241337]
We propose a novel $Q$-learning algorithm allowing to solve distributionally robust Markov decision problems.
Our approach goes beyond the well-studied cases involving ambiguity sets of balls around some reference measure.
arXiv Detail & Related papers (2024-07-05T05:19:36Z) - Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation.
We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Algorithm for Constrained Markov Decision Process with Linear
Convergence [55.41644538483948]
An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs.
A new dual approach is proposed with the integration of two ingredients: entropy regularized policy and Vaidya's dual.
The proposed approach is shown to converge (with linear rate) to the global optimum.
arXiv Detail & Related papers (2022-06-03T16:26:38Z) - Recursive Constraints to Prevent Instability in Constrained
Reinforcement Learning [16.019477271828745]
We consider the challenge of finding a deterministic policy for a Markov decision process.
This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability.
We present a suitable constrained reinforcement learning algorithm that prevents learning instability.
arXiv Detail & Related papers (2022-01-20T02:33:24Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - Concave Utility Reinforcement Learning with Zero-Constraint Violations [43.29210413964558]
We consider the problem of concave utility reinforcement learning (CURL) with convex constraints.
We propose a model-based learning algorithm that also achieves zero constraint violations.
arXiv Detail & Related papers (2021-09-12T06:13:33Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in
Coq [1.154957229836278]
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward.
This paper develops a formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes.
The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms.
arXiv Detail & Related papers (2020-09-23T22:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.