Reinforcement Learning in a Birth and Death Process: Breaking the
Dependence on the State Space
- URL: http://arxiv.org/abs/2302.10667v1
- Date: Tue, 21 Feb 2023 13:28:37 GMT
- Title: Reinforcement Learning in a Birth and Death Process: Breaking the
Dependence on the State Space
- Authors: Jonatha Anselmi (POLARIS, LIG), Bruno Gaujal (POLARIS, LIG),
Louis-S\'ebastien Rebuffi (POLARIS, LIG, UGA)
- Abstract summary: We revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure.
In our main result, we show that the regret of a slightly-ted version of the classical learning algorithm sc Ucrl2 is in fact upper bounded by $tildemathcalO(sqrtEAT)$ where $E$ is related to the weighted second moment of the stationary measure of a reference policy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we revisit the regret of undiscounted reinforcement learning
in MDPs with a birth and death structure. Specifically, we consider a
controlled queue with impatient jobs and the main objective is to optimize a
trade-off between energy consumption and user-perceived performance. Within
this setting, the \emph{diameter} $D$ of the MDP is $\Omega(S^S)$, where $S$ is
the number of states. Therefore, the existing lower and upper bounds on the
regret at time$T$, of order $O(\sqrt{DSAT})$ for MDPs with $S$ states and $A$
actions, may suggest that reinforcement learning is inefficient here. In our
main result however, we exploit the structure of our MDPs to show that the
regret of a slightly-tweaked version of the classical learning algorithm {\sc
Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where
$E_2$ is related to the weighted second moment of the stationary measure of a
reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our
bound is asymptotically independent of the number of states and of the
diameter. This result is based on a careful study of the number of visits
performed by the learning algorithm to the states of the MDP, which is highly
non-uniform.
Related papers
- Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
We show that UCBVI-$gamma$ achieves an $tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps.
In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tildeOmegabig(sqrtSAT/
arXiv Detail & Related papers (2020-10-01T17:57:47Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.