Restless Bandits with Average Reward: Breaking the Uniform Global
Attractor Assumption
- URL: http://arxiv.org/abs/2306.00196v3
- Date: Tue, 16 Jan 2024 05:42:06 GMT
- Title: Restless Bandits with Average Reward: Breaking the Uniform Global
Attractor Assumption
- Authors: Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang
- Abstract summary: A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large.
Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption.
We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
- Score: 12.471848976031904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the infinite-horizon restless bandit problem with the average reward
criterion, in both discrete-time and continuous-time settings. A fundamental
goal is to efficiently compute policies that achieve a diminishing optimality
gap as the number of arms, $N$, grows large. Existing results on asymptotic
optimality all rely on the uniform global attractor property (UGAP), a complex
and challenging-to-verify assumption. In this paper, we propose a general,
simulation-based framework, Follow-the-Virtual-Advice, that converts any
single-armed policy into a policy for the original $N$-armed problem. This is
done by simulating the single-armed policy on each arm and carefully steering
the real state towards the simulated state. Our framework can be instantiated
to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the
discrete-time setting, our result holds under a simpler synchronization
assumption, which covers some problem instances that violate UGAP. More
notably, in the continuous-time setting, we do not require \emph{any}
additional assumptions beyond the standard unichain condition. In both
settings, our work is the first asymptotic optimality result that does not
require UGAP.
Related papers
- Model Predictive Control is Almost Optimal for Restless Bandit [2.295863158976069]
We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem.
We propose a emphmodel predictive control based non-stationary policy with a rolling computational horizon $tau$.
We show that its sub-optimality gap is $O(1/sqrtN)$ in general, and $exp(-Omega(N))$ under a local-stability condition.
arXiv Detail & Related papers (2024-10-08T19:34:23Z) - Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption [11.41663079285674]
We propose a novel emphtwo-set policy that maintains two dynamic subsets of arms.
We show that our two-set policy is optimalally with an $O(exp(-C N)$ optimality gap for an $N$-armed problem.
arXiv Detail & Related papers (2024-05-28T07:08:29Z) - Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits [11.41663079285674]
We show that our policies are optimal with an $O(1/sqrtN)$ optimality gap for an $N$-armed problem.
Our approach departs from most existing work that focuses on index or priority policies.
arXiv Detail & Related papers (2024-02-08T14:07:20Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas.
While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$.
We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.