Optimistic Whittle Index Policy: Online Learning for Restless Bandits
- URL: http://arxiv.org/abs/2205.15372v1
- Date: Mon, 30 May 2022 18:32:20 GMT
- Title: Optimistic Whittle Index Policy: Online Learning for Restless Bandits
- Authors: Kai Wang, Lily Xu, Aparna Taneja, Milind Tambe
- Abstract summary: We propose the first online learning algorithm based on the Whittle index policy to learn transition dynamics.
Our algorithm, UCWhittle, achieves sublinear $O(sqrtT log T)$ frequentist regret to solve RMABs with unknown transitions.
- Score: 31.312043984489666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for
stateful arms, where the state of each arm evolves restlessly with different
transitions depending on whether that arm is pulled. However, solving RMABs
requires information on transition dynamics, which is often not available
upfront. To plan in RMAB settings with unknown transitions, we propose the
first online learning algorithm based on the Whittle index policy, using an
upper confidence bound (UCB) approach to learn transition dynamics.
Specifically, we formulate a bilinear program to compute the optimistic Whittle
index from the confidence bounds in transition dynamics. Our algorithm,
UCWhittle, achieves sublinear $O(\sqrt{T \log T})$ frequentist regret to solve
RMABs with unknown transitions. Empirically, we demonstrate that UCWhittle
leverages the structure of RMABs and the Whittle index policy solution to
achieve better performance than existing online learning baselines across three
domains, including on real-world maternal and childcare data aimed at reducing
maternal mortality.
Related papers
- GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - A Federated Online Restless Bandit Framework for Cooperative Resource Allocation [23.698976872351576]
We study the cooperative resource allocation problem with unknown system dynamics of MRPs.
We put forth a Federated Thompson-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem.
Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcalO(sqrtTlog(T))$ and better performance compared with baselines.
arXiv Detail & Related papers (2024-06-12T08:34:53Z) - Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy.
To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization.
The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z) - Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT)
Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables.
Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z) - Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback.
We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z) - Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure.
As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO)
Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z) - Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with
Application to Maternal and Child Health [36.442133189056136]
This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features.
The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions.
To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality.
arXiv Detail & Related papers (2022-02-02T08:36:10Z) - Robust Restless Bandits: Tackling Interval Uncertainty with Deep
Reinforcement Learning [31.515757763077065]
We introduce Robust Restless Bandits, a generalization of restless multi-arm bandits (RMAB)
We develop solutions for a minimax regret objective when transitions are given by interval uncertainties.
We introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs.
arXiv Detail & Related papers (2021-07-04T17:21:26Z) - Q-Learning Lagrange Policies for Multi-Action Restless Bandits [35.022322303796216]
Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed.
We design the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning.
arXiv Detail & Related papers (2021-06-22T19:20:09Z) - Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits [61.490254407420906]
We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain.
We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
arXiv Detail & Related papers (2020-11-05T05:16:04Z) - Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch [60.23815709215807]
We study the inverse reinforcement learning (IRL) problem under a transition dynamics mismatch between the expert and the learner.
We propose a robust MCE IRL algorithm, which is a principled approach to help with this mismatch.
arXiv Detail & Related papers (2020-07-02T14:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.