Related papers: Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits

Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits

URL: http://arxiv.org/abs/2211.00112v1
Date: Mon, 31 Oct 2022 19:35:15 GMT
Title: Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits
Authors: Abheek Ghosh, Dheeraj Nagaraj, Manish Jain, Milind Tambe
Abstract summary: We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. We first show that Whittle index policies can fail in simple and practically relevant settings. We then propose an alternate planning algorithm based on the mean-field method.
Score: 30.532795983761314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, \textit{even when} the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.

Related papers

GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB) GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z)
Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs [9.58750210024265]
We consider (stochastic) softmax policy gradient (PG) methods for bandits and Markov decision processes (MDPs) We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities. For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
arXiv Detail & Related papers (2024-05-21T18:12:39Z)
Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms. In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z)
Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods. Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits [55.03293214439741]
In contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience. We propose the first online continuous hyperparameter tuning framework for contextual bandits. We show that it could achieve a sublinear regret in theory and performs consistently better than all existing methods on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-18T23:31:20Z)
Near-optimality for infinite-horizon restless bandits with many arms [19.12759228067286]
Restless bandits are problems with applications in recommender systems, active learning, revenue management and other areas. We derive a class of policies, called fluid-balance policies, that have a $O(sqrtN)$ optimality gap. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
arXiv Detail & Related papers (2022-03-29T18:49:21Z)
Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z)
On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z)
An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z)
Asymptotic Randomised Control with applications to bandits [0.0]
We consider a general multi-armed bandit problem with correlated elements as a relaxed control problem. By introducing an entropy regularisation, we obtain a smooth approximation to the value function. This yields a novel semi-index approximation of the optimal decision process.
arXiv Detail & Related papers (2020-10-14T17:17:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.