Related papers: Near-optimality for infinite-horizon restless bandits with many arms

Near-optimality for infinite-horizon restless bandits with many arms

URL: http://arxiv.org/abs/2203.15853v1
Date: Tue, 29 Mar 2022 18:49:21 GMT
Title: Near-optimality for infinite-horizon restless bandits with many arms
Authors: Xiangyu Zhang, Peter I. Frazier
Abstract summary: Restless bandits are problems with applications in recommender systems, active learning, revenue management and other areas. We derive a class of policies, called fluid-balance policies, that have a $O(sqrtN)$ optimality gap. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
Score: 19.12759228067286
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Restless bandits are an important class of problems with applications in recommender systems, active learning, revenue management and other areas. We consider infinite-horizon discounted restless bandits with many arms where a fixed proportion of arms may be pulled in each period and where arms share a finite state space. Although an average-case-optimal policy can be computed via stochastic dynamic programming, the computation required grows exponentially with the number of arms $N$. Thus, it is important to find scalable policies that can be computed efficiently for large $N$ and that are near optimal in this regime, in the sense that the optimality gap (i.e. the loss of expected performance against an optimal policy) per arm vanishes for large $N$. However, the most popular approach, the Whittle index, requires a hard-to-verify indexability condition to be well-defined and another hard-to-verify condition to guarantee a $o(N)$ optimality gap. We present a method resolving these difficulties. By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, we derive a class of policies, called fluid-balance policies, that have a $O(\sqrt{N})$ optimality gap. Unlike the Whittle index, fluid-balance policies do not require indexability to be well-defined and their $O(\sqrt{N})$ optimality gap bound holds universally without sufficient conditions. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.

Related papers

Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits. Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator. We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption [11.41663079285674]
We propose a novel emphtwo-set policy that maintains two dynamic subsets of arms. We show that our two-set policy is optimalally with an $O(exp(-C N)$ optimality gap for an $N$-armed problem.
arXiv Detail & Related papers (2024-05-28T07:08:29Z)
Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits [11.41663079285674]
We show that our policies are optimal with an $O(1/sqrtN)$ optimality gap for an $N$-armed problem. Our approach departs from most existing work that focuses on index or priority policies.
arXiv Detail & Related papers (2024-02-08T14:07:20Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption [12.471848976031904]
A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
arXiv Detail & Related papers (2023-05-31T21:26:43Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits [30.532795983761314]
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. We first show that Whittle index policies can fail in simple and practically relevant settings. We then propose an alternate planning algorithm based on the mean-field method.
arXiv Detail & Related papers (2022-10-31T19:35:15Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas. While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$. We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.