Related papers: Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

URL: http://arxiv.org/abs/2511.08097v1
Date: Wed, 12 Nov 2025 01:39:37 GMT
Title: Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits
Authors: Dheeraj Narasimha, Nicolas Gast,
Abstract summary: We show that a natural finite-horizon LP-update policy with randomized rounding achieves an $O(log Nsqrt1/N)$ optimality gap in infinite time average reward problems.<n>Our results draw on techniques from the model predictive control literature by invoking the concept of emphdissipativity.
Score: 6.402634424631123
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the $N$ arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an $O(\log N\sqrt{1/N})$ optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time $t$ by planing over a time-horizon $\{t\dots t+τ\}$. Our simulation section demonstrates that our algorithm works extremely well even when $τ$ is very small and set to $5$, which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of \emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to $τ=1$. We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.

Related papers

Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization [85.91302339486673]
We study robust Markov decision processes (RMDPs) with general policy parameterization under s-rectangular and non-rectangular uncertainty sets.<n>We prove novel Lipschitz and Lipschitz-smoothness properties for general policy parameterizations that extends to infinite state spaces.<n>We design a projected gradient descent algorithm for s-rectangular uncertainty and a Frank-Wolfe algorithm for non-rectangular uncertainty.
arXiv Detail & Related papers (2026-02-11T21:44:20Z)
Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs [56.246783503873225]
This paper revisits the weighted strategy for non-stationary parametric bandits.<n>We propose a simpler weight-based algorithm that is as efficient as window/restart-based algorithms.<n>Our framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2026-01-03T04:50:21Z)
Achieving $\ ilde{\mathcal{O}}(1/N)$ Optimality Gap in Restless Bandits through Gaussian Approximation [21.34216861973257]
We study the finite-horizon MultiArmed Bandit (RMAB) problem with $N$ homogeneous arms.<n>Our approach is based on the construction of a Gaussian system that captures not only the mean but also the variance of the RMAB dynamics.<n>This is the first result to establish an $tildemathcalO (1/N)$ optimality gap for degenerate RMABs.
arXiv Detail & Related papers (2024-10-19T06:29:18Z)
Model Predictive Control is Almost Optimal for Restless Bandit [2.295863158976069]
We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem.<n>We propose a emphmodel predictive control based non-stationary policy with a rolling computational horizon $tau$.<n>We show that its sub-optimality gap is $O(1/sqrtN)$ in general, and $exp(-Omega(N))$ under a local-stability condition.
arXiv Detail & Related papers (2024-10-08T19:34:23Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [59.02170783023547]
We study the fixed-confidence best arm identification problem in the bandit framework.<n>We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs) We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.<n>We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption [12.471848976031904]
A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
arXiv Detail & Related papers (2023-05-31T21:26:43Z)
Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z)
Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.