Square-root regret bounds for continuous-time episodic Markov decision
processes
- URL: http://arxiv.org/abs/2210.00832v2
- Date: Tue, 3 Oct 2023 01:50:59 GMT
- Title: Square-root regret bounds for continuous-time episodic Markov decision
processes
- Authors: Xuefeng Gao and Xun Yu Zhou
- Abstract summary: We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting.
We present a learning algorithm based on the methods of iteration value and upper confidence bound.
- Score: 11.585113506994471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning for continuous-time Markov decision processes
(MDPs) in the finite-horizon episodic setting. In contrast to discrete-time
MDPs, the inter-transition times of a continuous-time MDP are exponentially
distributed with rate parameters depending on the state--action pair at each
transition. We present a learning algorithm based on the methods of value
iteration and upper confidence bound. We derive an upper bound on the
worst-case expected regret for the proposed algorithm, and establish a
worst-case lower bound, both bounds are of the order of square-root on the
number of episodes. Finally, we conduct simulation experiments to illustrate
the performance of our algorithm.
Related papers
- Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability [17.771354881467435]
We show that a simple algorithm with a universal and instance-independent step size is sufficient to obtain near-optimal variance and bias terms.
Our proof technique is based on refined error bounds for linear approximation together with the novel stability result for the product of random matrices.
arXiv Detail & Related papers (2023-10-22T12:37:25Z) - Provably Efficient Exploration in Constrained Reinforcement
Learning:Posterior Sampling Is All You Need [15.113053885573171]
We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting.
The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms.
arXiv Detail & Related papers (2023-09-27T15:48:36Z) - A Reduction-based Framework for Sequential Decision Making with Delayed
Feedback [53.79893086002961]
We study delayed feedback in general multi-agent sequential decision making.
We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm.
arXiv Detail & Related papers (2023-02-03T01:16:09Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z) - Logarithmic regret bounds for continuous-time average-reward Markov decision processes [9.806527798032809]
We consider reinforcement learning for continuous-time decision processes (MDPs) in the infinite-horizon, average-reward setting.
We derive instance-dependent regret lower bounds that are logarithmic in the time horizon.
arXiv Detail & Related papers (2022-05-23T10:15:00Z) - A Kernel-Based Approach to Non-Stationary Reinforcement Learning in
Metric Spaces [53.47210316424326]
KeRNS is an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes.
We prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time.
arXiv Detail & Related papers (2020-07-09T21:37:13Z) - Logarithmic regret for episodic continuous-time linear-quadratic
reinforcement learning over a finite-time horizon [7.123160883637873]
We study continuous-time linear-quadratic reinforcement learning problems in an episodic setting.
We propose a least-squares algorithm based on continuous-time observations and controls.
arXiv Detail & Related papers (2020-06-27T08:14:59Z) - Planning in Markov Decision Processes with Gap-Dependent Sample
Complexity [48.98199700043158]
We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process.
We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability.
arXiv Detail & Related papers (2020-06-10T15:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.