Provably Efficient Exploration in Constrained Reinforcement
Learning:Posterior Sampling Is All You Need
- URL: http://arxiv.org/abs/2309.15737v1
- Date: Wed, 27 Sep 2023 15:48:36 GMT
- Title: Provably Efficient Exploration in Constrained Reinforcement
Learning:Posterior Sampling Is All You Need
- Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy and Maurits Kaptein
- Abstract summary: We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting.
The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms.
- Score: 15.113053885573171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new algorithm based on posterior sampling for learning in
constrained Markov decision processes (CMDP) in the infinite-horizon
undiscounted setting. The algorithm achieves near-optimal regret bounds while
being advantageous empirically compared to the existing algorithms. Our main
theoretical result is a Bayesian regret bound for each cost component of
\tilde{O} (HS \sqrt{AT}) for any communicating CMDP with S states, A actions,
and bound on the hitting time H. This regret bound matches the lower bound in
order of time horizon T and is the best-known regret bound for communicating
CMDPs in the infinite-horizon undiscounted setting. Empirical results show
that, despite its simplicity, our posterior sampling algorithm outperforms the
existing algorithms for constrained reinforcement learning.
Related papers
- Deep Inertia $L_p$ Half-Quadratic Splitting Unrolling Network for Sparse View CT Reconstruction [20.632166806596278]
Sparse view computed tomography (CT) reconstruction poses a challenging ill-posed inverse problem, necessitating effective regularization techniques.
We employ $L_p$-norm regularization to induce sparsity and introduce inertial steps, leading to the development of the inertial $L_p$-norm half-quadratic splitting algorithm.
Our proposed algorithm surpasses existing methods, particularly excelling in fewer scanned views and complex noise conditions.
arXiv Detail & Related papers (2024-08-13T03:32:59Z) - Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling [14.776559457850624]
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP)
The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms.
arXiv Detail & Related papers (2024-05-29T11:59:56Z) - Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence [7.586600116278698]
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN)
arXiv Detail & Related papers (2023-06-10T10:04:54Z) - Uniform-PAC Bounds for Reinforcement Learning with Linear Function
Approximation [92.3161051419884]
We study reinforcement learning with linear function approximation.
Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees.
We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
arXiv Detail & Related papers (2021-06-22T08:48:56Z) - Experimental Design for Regret Minimization in Linear Bandits [19.8309784360219]
We propose a novel design-based algorithm to minimize regret in online linear and bandits.
We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime.
arXiv Detail & Related papers (2020-11-01T17:59:19Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Accelerated Message Passing for Entropy-Regularized MAP Inference [89.15658822319928]
Maximum a posteriori (MAP) inference in discrete-valued random fields is a fundamental problem in machine learning.
Due to the difficulty of this problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms.
We present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient.
arXiv Detail & Related papers (2020-07-01T18:43:32Z) - Lagrangian Decomposition for Neural Network Verification [148.0448557991349]
A fundamental component of neural network verification is the computation of bounds on the values their outputs can take.
We propose a novel approach based on Lagrangian Decomposition.
We show that we obtain bounds comparable with off-the-shelf solvers in a fraction of their running time.
arXiv Detail & Related papers (2020-02-24T17:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.