Related papers: Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

URL: http://arxiv.org/abs/2508.10804v1
Date: Thu, 14 Aug 2025 16:26:00 GMT
Title: Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee
Authors: Yu-Heng Hung, Ping-Chun Hsieh, Kai Wang,
Abstract summary: Online restless multi-armed bandits (RMABs) assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards.<n>In real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics.<n>Our proposed rmab; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations.
Score: 14.201646000111868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider $N$-armd RMAB with non-stationary transition constrained by bounded variation budgets $B$. Our proposed \rmab\; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that \rmab\; achieves $\widetilde{\mathcal{O}}(N^2 B^{\frac{1}{4}} T^{\frac{3}{4}})$ regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.

Related papers

Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits [9.351444106520516]
Non-stationary multi-armed bandits enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions.<n>We introduce a new model of constrained feedback in non-stationary multi-armed bandits, where the availability of reward feedback is restricted.<n>We propose the first prior-free algorithm that achieves near-optimal dynamic regret in this setting.
arXiv Detail & Related papers (2025-09-18T15:35:32Z)
From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation [13.490692458295301]
RAVEN-UCB is a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation.<n>It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K sigma_max2 log T / Delta$ and gap-independent regret of order $sqrtK T log T$.
arXiv Detail & Related papers (2025-06-03T14:35:04Z)
Influential Bandits: Pulling an Arm May Change the Environment [44.71145269686588]
Real-world applications often involve non-stationary environments and interdependencies between arms.<n>We propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix.<n>We introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics.
arXiv Detail & Related papers (2025-04-11T02:05:51Z)
Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.<n>This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.<n>Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT)<n>Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables.<n>Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z)
Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget [6.22018632187078]
We introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. The goal is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$.
arXiv Detail & Related papers (2024-05-23T22:35:11Z)
A Robustness Analysis of Blind Source Separation [91.3755431537592]
Blind source separation (BSS) aims to recover an unobserved signal from its mixture $X=f(S)$ under the condition that the transformation $f$ is invertible but unknown. We present a general framework for analysing such violations and quantifying their impact on the blind recovery of $S$ from $X$. We show that a generic BSS-solution in response to general deviations from its defining structural assumptions can be profitably analysed in the form of explicit continuity guarantees.
arXiv Detail & Related papers (2023-03-17T16:30:51Z)
Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret. We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z)
Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms [59.8188496313214]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound. First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals. Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z)
Optimistic Whittle Index Policy: Online Learning for Restless Bandits [31.312043984489666]
We propose the first online learning algorithm based on the Whittle index policy to learn transition dynamics. Our algorithm, UCWhittle, achieves sublinear $O(sqrtT log T)$ frequentist regret to solve RMABs with unknown transitions.
arXiv Detail & Related papers (2022-05-30T18:32:20Z)
Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning [31.515757763077065]
We introduce Robust Restless Bandits, a generalization of restless multi-arm bandits (RMAB) We develop solutions for a minimax regret objective when transitions are given by interval uncertainties. We introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs.
arXiv Detail & Related papers (2021-07-04T17:21:26Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.