Safe Posterior Sampling for Constrained MDPs with Bounded Constraint
Violation
- URL: http://arxiv.org/abs/2301.11547v1
- Date: Fri, 27 Jan 2023 06:18:25 GMT
- Title: Safe Posterior Sampling for Constrained MDPs with Bounded Constraint
Violation
- Authors: Krishna C Kalagarla, Rahul Jain, Pierluigi Nuzzo
- Abstract summary: Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications.
We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well.
We establish a sub-linear $tildemathcal Oleft(H2.5 sqrt|mathcalS|2 |mathcalA| K right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.
- Score: 8.849815837266977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Constrained Markov decision processes (CMDPs) model scenarios of sequential
decision making with multiple objectives that are increasingly important in
many applications. However, the model is often unknown and must be learned
online while still ensuring the constraint is met, or at least the violation is
bounded with time. Some recent papers have made progress on this very
challenging problem but either need unsatisfactory assumptions such as
knowledge of a safe policy, or have high cumulative regret. We propose the Safe
PSRL (posterior sampling-based RL) algorithm that does not need such
assumptions and yet performs very well, both in terms of theoretical regret
bounds as well as empirically. The algorithm achieves an efficient tradeoff
between exploration and exploitation by use of the posterior sampling
principle, and provably suffers only bounded constraint violation by leveraging
the idea of pessimism. Our approach is based on a primal-dual approach. We
establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2
|\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret
along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint
violation regret over $K$ episodes for an $|\mathcal{S}|$-state,
$|\mathcal{A}|$-action and horizon $H$ CMDP.
Related papers
- Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment.
The problem becomes more challenging when the decision requirement includes satisfying some safety constraints.
Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies.
An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z) - Learning Policies with Zero or Bounded Constraint Violation for
Constrained MDPs [17.825031573375725]
We pose the problem in an episodic framework of a constrained Markov decision process.
It is possible to achieve a reward regret of $tildemathcalO(sqrtK)$ while allowing an $tildemathcalO(sqrtK)$ constraint violation.
We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability.
arXiv Detail & Related papers (2021-06-04T19:46:55Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.