Indexability is Not Enough for Whittle: Improved, Near-Optimal
Algorithms for Restless Bandits
- URL: http://arxiv.org/abs/2211.00112v1
- Date: Mon, 31 Oct 2022 19:35:15 GMT
- Title: Indexability is Not Enough for Whittle: Improved, Near-Optimal
Algorithms for Restless Bandits
- Authors: Abheek Ghosh, Dheeraj Nagaraj, Manish Jain, Milind Tambe
- Abstract summary: We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions.
We first show that Whittle index policies can fail in simple and practically relevant settings.
We then propose an alternate planning algorithm based on the mean-field method.
- Score: 30.532795983761314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of planning restless multi-armed bandits (RMABs) with
multiple actions. This is a popular model for multi-agent systems with
applications like multi-channel communication, monitoring and machine
maintenance tasks, and healthcare. Whittle index policies, which are based on
Lagrangian relaxations, are widely used in these settings due to their
simplicity and near-optimality under certain conditions. In this work, we first
show that Whittle index policies can fail in simple and practically relevant
RMAB settings, \textit{even when} the RMABs are indexable. We discuss why the
optimality guarantees fail and why asymptotic optimality may not translate well
to practically relevant planning horizons.
We then propose an alternate planning algorithm based on the mean-field
method, which can provably and efficiently obtain near-optimal policies with a
large number of arms, without the stringent structural assumptions required by
the Whittle index policies. This borrows ideas from existing research with some
improvements: our approach is hyper-parameter free, and we provide an improved
non-asymptotic analysis which has: (a) no requirement for exogenous
hyper-parameters and tighter polynomial dependence on known problem parameters;
(b) high probability bounds which show that the reward of the policy is
reliable; and (c) matching sub-optimality lower bounds for this algorithm with
respect to the number of arms, thus demonstrating the tightness of our bounds.
Our extensive experimental analysis shows that the mean-field approach matches
or outperforms other baselines.
Related papers
- GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits [16.054685587034836]
GINO-Q is a three-timescale approximation algorithm designed to learn an optimal index policy for restless multi-armed bandit (RMAB)
GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability.
Our experimental results demonstrate that GINO-Q consistently learns near optimal policies, even for non-indexable RMABs.
arXiv Detail & Related papers (2024-08-19T10:50:45Z) - Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs [9.58750210024265]
We consider (stochastic) softmax policy gradient (PG) methods for bandits and Markov decision processes (MDPs)
We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities.
For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
arXiv Detail & Related papers (2024-05-21T18:12:39Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Optimizing Solution-Samplers for Combinatorial Problems: The Landscape
of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods.
Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem.
As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits [55.03293214439741]
In contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience.
We propose the first online continuous hyperparameter tuning framework for contextual bandits.
We show that it could achieve a sublinear regret in theory and performs consistently better than all existing methods on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-18T23:31:20Z) - Near-optimality for infinite-horizon restless bandits with many arms [19.12759228067286]
Restless bandits are problems with applications in recommender systems, active learning, revenue management and other areas.
We derive a class of policies, called fluid-balance policies, that have a $O(sqrtN)$ optimality gap.
We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
arXiv Detail & Related papers (2022-03-29T18:49:21Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - Asymptotic Randomised Control with applications to bandits [0.0]
We consider a general multi-armed bandit problem with correlated elements as a relaxed control problem.
By introducing an entropy regularisation, we obtain a smooth approximation to the value function.
This yields a novel semi-index approximation of the optimal decision process.
arXiv Detail & Related papers (2020-10-14T17:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.