Related papers: Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Optimistic Whittle Index Policy: Online Learning for Restless Bandits

URL: http://arxiv.org/abs/2205.15372v1
Date: Mon, 30 May 2022 18:32:20 GMT
Title: Optimistic Whittle Index Policy: Online Learning for Restless Bandits
Authors: Kai Wang, Lily Xu, Aparna Taneja, Milind Tambe
Abstract summary: We propose the first online learning algorithm based on the Whittle index policy to learn transition dynamics. Our algorithm, UCWhittle, achieves sublinear $O(sqrtT log T)$ frequentist regret to solve RMABs with unknown transitions.
Score: 31.312043984489666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. However, solving RMABs requires information on transition dynamics, which is often not available upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we formulate a bilinear program to compute the optimistic Whittle index from the confidence bounds in transition dynamics. Our algorithm, UCWhittle, achieves sublinear $O(\sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including on real-world maternal and childcare data aimed at reducing maternal mortality.

Related papers

Continual Multimodal Contrastive Learning [70.60542106731813]
Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z)
GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB) GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z)
A Federated Online Restless Bandit Framework for Cooperative Resource Allocation [23.698976872351576]
We study the cooperative resource allocation problem with unknown system dynamics of MRPs. We put forth a Federated Thompson-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. Numerical results show that the proposed algorithm achieves a fast convergence rate of $mathcalO(sqrtTlog(T))$ and better performance compared with baselines.
arXiv Detail & Related papers (2024-06-12T08:34:53Z)
Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy. To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization. The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z)
Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT) Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z)
Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback. We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z)
Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO) Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z)
Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health [36.442133189056136]
This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality.
arXiv Detail & Related papers (2022-02-02T08:36:10Z)
Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning [31.515757763077065]
We introduce Robust Restless Bandits, a generalization of restless multi-arm bandits (RMAB) We develop solutions for a minimax regret objective when transitions are given by interval uncertainties. We introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs.
arXiv Detail & Related papers (2021-07-04T17:21:26Z)
Q-Learning Lagrange Policies for Multi-Action Restless Bandits [35.022322303796216]
Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. We design the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning.
arXiv Detail & Related papers (2021-06-22T19:20:09Z)
Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits [61.490254407420906]
We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain. We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
arXiv Detail & Related papers (2020-11-05T05:16:04Z)
Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch [60.23815709215807]
We study the inverse reinforcement learning (IRL) problem under a transition dynamics mismatch between the expert and the learner. We propose a robust MCE IRL algorithm, which is a principled approach to help with this mismatch.
arXiv Detail & Related papers (2020-07-02T14:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.