Related papers: From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation

From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation

URL: http://arxiv.org/abs/2506.02933v1
Date: Tue, 03 Jun 2025 14:35:04 GMT
Title: From Theory to Practice with RAVEN-UCB: Addressing Non-Stationarity in Multi-Armed Bandits through Variance Adaptation
Authors: Junyi Fang, Yuxun Chen, Yuxin Chen, Chen Zhang,
Abstract summary: RAVEN-UCB is a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation.<n>It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K sigma_max2 log T / Delta$ and gap-independent regret of order $sqrtK T log T$.
Score: 13.490692458295301
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Multi-Armed Bandit (MAB) problem is challenging in non-stationary environments where reward distributions evolve dynamically. We introduce RAVEN-UCB, a novel algorithm that combines theoretical rigor with practical efficiency via variance-aware adaptation. It achieves tighter regret bounds than UCB1 and UCB-V, with gap-dependent regret of order $K \sigma_{\max}^2 \log T / \Delta$ and gap-independent regret of order $\sqrt{K T \log T}$. RAVEN-UCB incorporates three innovations: (1) variance-driven exploration using $\sqrt{\hat{\sigma}_k^2 / (N_k + 1)}$ in confidence bounds, (2) adaptive control via $\alpha_t = \alpha_0 / \log(t + \epsilon)$, and (3) constant-time recursive updates for efficiency. Experiments across non-stationary patterns - distributional changes, periodic shifts, and temporary fluctuations - in synthetic and logistics scenarios demonstrate its superiority over state-of-the-art baselines, confirming theoretical and practical robustness.

Related papers

Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee [14.201646000111868]
Online restless multi-armed bandits (RMABs) assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards.<n>In real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics.<n>Our proposed rmab; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations.
arXiv Detail & Related papers (2025-08-14T16:26:00Z)
LLMs are Bayesian, in Expectation, not in Realization [0.0]
Large language models adapt to new tasks without parameter updates.<n>Recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property.<n>This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications.
arXiv Detail & Related papers (2025-07-15T22:20:11Z)
Non-stationary Online Learning for Curved Losses: Improved Dynamic Regret via Mixability [65.99855403424979]
We show that dynamic regret can be substantially improved by leveraging the concept of mixability.<n>We demonstrate that an exponential-weight method with fixed-share updates achieves an $mathcalO(d T2/3 P_T2/3 log T)$ dynamic regret for mixable losses.
arXiv Detail & Related papers (2025-06-12T12:00:08Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Sample Complexity of Variance-reduced Distributionally Robust Q-learning [17.96094201655567]
This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. A series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.
arXiv Detail & Related papers (2023-05-28T19:40:46Z)
Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds [46.30750729936261]
Follow-the-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems. We establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW)
arXiv Detail & Related papers (2023-05-26T23:20:48Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds [34.37963000493442]
This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
arXiv Detail & Related papers (2022-06-14T12:58:46Z)
Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics [6.287145010885044]
This paper establishes a novel regret upper-bound of $O_p(sqrtT,textpolylog(T))$. It simultaneously establishes an estimation error bound on the dynamics of $O_p(sqrtT,textpolylog(T))$.
arXiv Detail & Related papers (2022-02-11T17:50:14Z)
Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games [104.74734408204749]
We show that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(textrmpolylog(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium.
arXiv Detail & Related papers (2021-11-11T01:19:53Z)
Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits [61.490254407420906]
We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain. We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
arXiv Detail & Related papers (2020-11-05T05:16:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.