Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes
- URL: http://arxiv.org/abs/2409.04605v1
- Date: Fri, 6 Sep 2024 20:24:19 GMT
- Title: Whittle Index Learning Algorithms for Restless Bandits with Constant Stepsizes
- Authors: Vishesh Mittal, Rahul Meshram, Surya Prakash,
- Abstract summary: We study the Whittle index learning algorithm for restless multi-armed bandits.
We first present Q-learning algorithm with exploration policies -- epsilon-greedy, softmax, epsilon-softmax with constant stepsizes.
We extend the study of Q-learning to index learning for single-armed restless bandit.
- Score: 3.3918638314432945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the Whittle index learning algorithm for restless multi-armed bandits. We consider index learning algorithm with Q-learning. We first present Q-learning algorithm with exploration policies -- epsilon-greedy, softmax, epsilon-softmax with constant stepsizes. We extend the study of Q-learning to index learning for single-armed restless bandit. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. In Q-learning updates are in asynchronous manner. We study constant stepsizes two timescale stochastic approximation algorithm. We provide analysis of two-timescale stochastic approximation for index learning with constant stepsizes. Further, we present study on index learning with deep Q-network (DQN) learning and linear function approximation with state-aggregation method. We describe the performance of our algorithms using numerical examples. We have shown that index learning with Q learning, DQN and function approximations learns the Whittle index.
Related papers
- Model-Based Learning of Whittle indices [5.830619388189558]
We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP)<n>BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation.<n>It has a total computational cost even lower than Q-learning for any reasonably high number of samples.
arXiv Detail & Related papers (2025-11-25T15:21:00Z) - Faster Q-Learning Algorithms for Restless Bandits [0.0]
We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB)
We first present Q-learning algorithm and its variants -- speedy Q-learning (RMAB), generalized speedy Q-learning (G) and phase Q-learning (PhaseQL)
arXiv Detail & Related papers (2024-09-06T20:55:07Z) - Two-Step Q-Learning [0.0]
The paper proposes a novel off-policy two-step Q-learning algorithm, without importance sampling.
Numerical experiments demonstrate the superior performance of both the two-step Q-learning and its smooth variants.
arXiv Detail & Related papers (2024-07-02T15:39:00Z) - Tabular and Deep Learning for the Whittle Index [0.2749898166276853]
We present QWI and QWINN, two reinforcement learning algorithms for learning the Whittle index for the total discounted criterion.
In our main theoretical result we show that QWI converges to the real Whittle indices.
For QWINN, we show that all local minima of the Bellman error are locally stable equilibria.
Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm.
arXiv Detail & Related papers (2024-06-04T07:41:15Z) - q-Learning in Continuous Time [1.4213973379473654]
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation.
We develop a q-learning" theory around the q-function that is independent of time discretization.
We devise different actor-critic algorithms for solving underlying problems.
arXiv Detail & Related papers (2022-07-02T02:20:41Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms.
The learned algorithms are domain-agnostic and can generalize to new environments not seen during training.
We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Momentum Q-learning with Finite-Sample Convergence Guarantee [49.38471009162477]
This paper analyzes a class of momentum-based Q-learning algorithms with finite-sample guarantee.
We establish the convergence guarantee for MomentumQ with linear function approximations and Markovian sampling.
We demonstrate through various experiments that the proposed MomentumQ outperforms other momentum-based Q-learning algorithms.
arXiv Detail & Related papers (2020-07-30T12:27:03Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Statistical Adaptive Stochastic Gradient Methods [34.859895010071234]
We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in gradient methods.
SALSA first uses a smoothed line-search procedure to gradually increase the learning rate, then automatically decreases the learning rate.
The method for decreasing the learning rate is based on a new statistical test for detecting station switches when using a constant step size.
arXiv Detail & Related papers (2020-02-25T00:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.