Logarithmic regret bounds for continuous-time average-reward Markov decision processes
- URL: http://arxiv.org/abs/2205.11168v4
- Date: Tue, 2 Jul 2024 06:57:21 GMT
- Title: Logarithmic regret bounds for continuous-time average-reward Markov decision processes
- Authors: Xuefeng Gao, Xun Yu Zhou,
- Abstract summary: We consider reinforcement learning for continuous-time decision processes (MDPs) in the infinite-horizon, average-reward setting.
We derive instance-dependent regret lower bounds that are logarithmic in the time horizon.
- Score: 9.806527798032809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
Related papers
- On Bellman equations for continuous-time policy evaluation I: discretization and approximation [3.704688279256839]
We study the problem of computing the value function from a discretely-observed trajectory of a continuous-time diffusion process.
We develop a new class of algorithms based on easily implementable numerical schemes that are compatible with discrete-time reinforcement learning.
arXiv Detail & Related papers (2024-07-08T14:05:03Z) - An Idiosyncrasy of Time-discretization in Reinforcement Learning [7.085780872622857]
We study how the choice of discretization may affect a reinforcement learning algorithm.
We acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment.
arXiv Detail & Related papers (2024-06-21T08:03:25Z) - Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Bridging discrete and continuous state spaces: Exploring the Ehrenfest process in time-continuous diffusion models [4.186575888568896]
We study time-continuous Markov jump processes on discrete state spaces.
We show that the time-reversal of the Ehrenfest process converges to the time-reversed Ornstein-Uhlenbeck process.
arXiv Detail & Related papers (2024-05-06T15:12:51Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Square-root regret bounds for continuous-time episodic Markov decision
processes [11.585113506994471]
We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting.
We present a learning algorithm based on the methods of iteration value and upper confidence bound.
arXiv Detail & Related papers (2022-10-03T11:35:07Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - A Kernel-Based Approach to Non-Stationary Reinforcement Learning in
Metric Spaces [53.47210316424326]
KeRNS is an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes.
We prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time.
arXiv Detail & Related papers (2020-07-09T21:37:13Z) - Logarithmic regret for episodic continuous-time linear-quadratic
reinforcement learning over a finite-time horizon [7.123160883637873]
We study continuous-time linear-quadratic reinforcement learning problems in an episodic setting.
We propose a least-squares algorithm based on continuous-time observations and controls.
arXiv Detail & Related papers (2020-06-27T08:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.