Near-optimality for infinite-horizon restless bandits with many arms
- URL: http://arxiv.org/abs/2203.15853v1
- Date: Tue, 29 Mar 2022 18:49:21 GMT
- Title: Near-optimality for infinite-horizon restless bandits with many arms
- Authors: Xiangyu Zhang, Peter I. Frazier
- Abstract summary: Restless bandits are problems with applications in recommender systems, active learning, revenue management and other areas.
We derive a class of policies, called fluid-balance policies, that have a $O(sqrtN)$ optimality gap.
We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
- Score: 19.12759228067286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Restless bandits are an important class of problems with applications in
recommender systems, active learning, revenue management and other areas. We
consider infinite-horizon discounted restless bandits with many arms where a
fixed proportion of arms may be pulled in each period and where arms share a
finite state space. Although an average-case-optimal policy can be computed via
stochastic dynamic programming, the computation required grows exponentially
with the number of arms $N$. Thus, it is important to find scalable policies
that can be computed efficiently for large $N$ and that are near optimal in
this regime, in the sense that the optimality gap (i.e. the loss of expected
performance against an optimal policy) per arm vanishes for large $N$. However,
the most popular approach, the Whittle index, requires a hard-to-verify
indexability condition to be well-defined and another hard-to-verify condition
to guarantee a $o(N)$ optimality gap. We present a method resolving these
difficulties. By replacing a global Lagrange multiplier used by the Whittle
index with a sequence of Lagrangian multipliers, one per time period up to a
finite truncation point, we derive a class of policies, called fluid-balance
policies, that have a $O(\sqrt{N})$ optimality gap. Unlike the Whittle index,
fluid-balance policies do not require indexability to be well-defined and their
$O(\sqrt{N})$ optimality gap bound holds universally without sufficient
conditions. We also demonstrate empirically that fluid-balance policies provide
state-of-the-art performance on specific problems.
Related papers
- Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption [11.41663079285674]
We propose a novel emphtwo-set policy that maintains two dynamic subsets of arms.
We show that our two-set policy is optimalally with an $O(exp(-C N)$ optimality gap for an $N$-armed problem.
arXiv Detail & Related papers (2024-05-28T07:08:29Z) - Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits [11.41663079285674]
We show that our policies are optimal with an $O(1/sqrtN)$ optimality gap for an $N$-armed problem.
Our approach departs from most existing work that focuses on index or priority policies.
arXiv Detail & Related papers (2024-02-08T14:07:20Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Restless Bandits with Average Reward: Breaking the Uniform Global
Attractor Assumption [12.471848976031904]
A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large.
Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption.
We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
arXiv Detail & Related papers (2023-05-31T21:26:43Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Indexability is Not Enough for Whittle: Improved, Near-Optimal
Algorithms for Restless Bandits [30.532795983761314]
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions.
We first show that Whittle index policies can fail in simple and practically relevant settings.
We then propose an alternate planning algorithm based on the mean-field method.
arXiv Detail & Related papers (2022-10-31T19:35:15Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas.
While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$.
We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.