Reinforcement Learning and Regret Bounds for Admission Control
- URL: http://arxiv.org/abs/2406.04766v1
- Date: Fri, 7 Jun 2024 09:09:14 GMT
- Title: Reinforcement Learning and Regret Bounds for Admission Control
- Authors: Lucas Weber, Ana Bušić, Jiamin Zhu,
- Abstract summary: The expected regret of any reinforcement learning algorithm is lower bounded by $Omegaleft(sqrtDXATright)$ for undiscounted returns.
We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(Slog T + sqrtmT log T)$ in the finite server case.
- Score: 0.08192907805418585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The expected regret of any reinforcement learning algorithm is lower bounded by $\Omega\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears.
Related papers
- Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - On learning Whittle index policy for restless bandits with scalable
regret [1.0152838128195465]
Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies.
The cumulative regret of most RL algorithms scales as $tilde O(mathsfS sqrtmathsfA T)$, where $mathsfS$ is the size of the state space.
We present a model-based RL algorithm for such problems which has scalable regret.
arXiv Detail & Related papers (2022-02-07T19:07:02Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z) - Logistic Regression Regret: What's the Catch? [3.7311680121118345]
We derive lower bounds with logarithmic regret under $L_infty$ constraints on the parameters.
For $L$ constraints, it is shown that for large enough $d$, the regret remains linear in $d$ but no longer logarithmic in $T$.
arXiv Detail & Related papers (2020-02-07T18:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.