On learning Whittle index policy for restless bandits with scalable
regret
- URL: http://arxiv.org/abs/2202.03463v2
- Date: Wed, 26 Apr 2023 23:46:29 GMT
- Title: On learning Whittle index policy for restless bandits with scalable
regret
- Authors: Nima Akbarzadeh, Aditya Mahajan
- Abstract summary: Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies.
The cumulative regret of most RL algorithms scales as $tilde O(mathsfS sqrtmathsfA T)$, where $mathsfS$ is the size of the state space.
We present a model-based RL algorithm for such problems which has scalable regret.
- Score: 1.0152838128195465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning is an attractive approach to learn good resource
allocation and scheduling policies based on data when the system model is
unknown. However, the cumulative regret of most RL algorithms scales as $\tilde
O(\mathsf{S} \sqrt{\mathsf{A} T})$, where $\mathsf{S}$ is the size of the state
space, $\mathsf{A}$ is the size of the action space, $T$ is the horizon, and
the $\tilde{O}(\cdot)$ notation hides logarithmic terms. Due to the linear
dependence on the size of the state space, these regret bounds are
prohibitively large for resource allocation and scheduling problems. In this
paper, we present a model-based RL algorithm for such problems which has
scalable regret. In particular, we consider a restless bandit model, and
propose a Thompson-sampling based learning algorithm which is tuned to the
underlying structure of the model. We present two characterizations of the
regret of the proposed algorithm with respect to the Whittle index policy.
First, we show that for a restless bandit with $n$ arms and at most $m$
activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$
or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an
additional technical assumption, we show that the regret scales as
$\tilde{O}(n^{1.5} \sqrt{T})$ or $\tilde{O}(\max\{m\sqrt{n}, n\} \sqrt{T})$. We
present numerical examples to illustrate the salient features of the algorithm.
Related papers
- Reinforcement Learning and Regret Bounds for Admission Control [0.08192907805418585]
The expected regret of any reinforcement learning algorithm is lower bounded by $Omegaleft(sqrtDXATright)$ for undiscounted returns.
We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(Slog T + sqrtmT log T)$ in the finite server case.
arXiv Detail & Related papers (2024-06-07T09:09:14Z) - Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems.
We design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w*$ and the hyperplanes the separation oracle returns.
arXiv Detail & Related papers (2021-06-09T05:39:05Z) - Provably Efficient Reinforcement Learning with Linear Function
Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model.
Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret.
For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.