Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits
- URL: http://arxiv.org/abs/2402.05689v3
- Date: Thu, 03 Oct 2024 17:37:33 GMT
- Title: Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits
- Authors: Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang,
- Abstract summary: We show that our policies are optimal with an $O(1/sqrtN)$ optimality gap for an $N$-armed problem.
Our approach departs from most existing work that focuses on index or priority policies.
- Score: 11.41663079285674
- License:
- Abstract: We consider the infinite-horizon, average-reward restless bandit problem in discrete time. We propose a new class of policies that are designed to drive a progressively larger subset of arms toward the optimal distribution. We show that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality gap for an $N$-armed problem, assuming only a unichain and aperiodicity assumption. Our approach departs from most existing work that focuses on index or priority policies, which rely on the Global Attractor Property (GAP) to guarantee convergence to the optimum, or a recently developed simulation-based policy, which requires a Synchronization Assumption (SA).
Related papers
- Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption [11.41663079285674]
We propose a novel emphtwo-set policy that maintains two dynamic subsets of arms.
We show that our two-set policy is optimalally with an $O(exp(-C N)$ optimality gap for an $N$-armed problem.
arXiv Detail & Related papers (2024-05-28T07:08:29Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Restless Bandits with Average Reward: Breaking the Uniform Global
Attractor Assumption [12.471848976031904]
A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large.
Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption.
We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
arXiv Detail & Related papers (2023-05-31T21:26:43Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Near-optimality for infinite-horizon restless bandits with many arms [19.12759228067286]
Restless bandits are problems with applications in recommender systems, active learning, revenue management and other areas.
We derive a class of policies, called fluid-balance policies, that have a $O(sqrtN)$ optimality gap.
We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
arXiv Detail & Related papers (2022-03-29T18:49:21Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas.
While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$.
We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.