Related papers: Smooth Non-Stationary Bandits

Smooth Non-Stationary Bandits

URL: http://arxiv.org/abs/2301.12366v3
Date: Sun, 17 Nov 2024 18:03:40 GMT
Title: Smooth Non-Stationary Bandits
Authors: Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier,
Abstract summary: We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $beta$-H"older function. We show the first separation between the smooth (i.e., $betage 2$) and non-smooth (i.e., $beta=1$) regimes by presenting a policy with $tilde O(k4/5 T3/5)$ regret on any $k$-armed, $2$-H"older instance.
Score: 49.19728527803684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $\beta$ increases. When $\beta=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde \Theta(T^{2/3})$. We show the first separation between the smooth (i.e., $\beta\ge 2$) and non-smooth (i.e., $\beta=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-H\"older instance. We complement this result by showing that the minimax regret on the $\beta$-H\"older family of instances is $\Omega(T^{(\beta+1)/(2\beta+1)})$ for any integer $\beta\ge 1$. This matches our upper bound for $\beta=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.

Related papers

Variance-Dependent Regret Lower Bounds for Contextual Bandits [65.93854043353328]
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $tildeO(dsqrtK)$ regret bound to $tildeO(dsqrtsum_k=1Ksigma_k2)$.
arXiv Detail & Related papers (2025-03-15T07:09:36Z)
Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update [62.96781471194877]
Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits.<n>We propose a emphone-pass algorithm based on the online mirror descent framework.
arXiv Detail & Related papers (2025-03-01T09:41:45Z)
p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds. Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z)
Generalized Linear Bandits with Limited Adaptivity [12.112051468737596]
We study the generalized linear contextual bandit problem within the constraints of limited adaptivity.<n>We present two algorithms, $textttB-GLinCB$ and $textttRS-GLinCB$, that address, respectively, two prevalent limited adaptivity settings.
arXiv Detail & Related papers (2024-04-10T08:47:57Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z)
Tight Memory-Regret Lower Bounds for Streaming Bandits [11.537938617281736]
learner aims to minimize regret by dealing with online arriving arms and sublinear arm memory. We establish the tight worst-case regret lower bound of $Omega left( (TB)alpha K1-alpharight), alpha = 2B / (2B+1-1)$ for any algorithm. We also provide a multi-pass algorithm that achieves a regret upper bound of $tildeO left( (TB)alpha K1 - alpharight)$ using constant arm memory.
arXiv Detail & Related papers (2023-06-13T16:54:13Z)
On the Minimax Regret for Online Learning with Feedback Graphs [5.721380617450645]
We improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs. Our improved upper bound $mathcalObigl(sqrtalpha T(ln K)/(lnalpha)bigr)$ holds for any $alpha$ and matches the lower bounds for bandits and experts.
arXiv Detail & Related papers (2023-05-24T17:40:57Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$. In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret. We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z)
Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
Smooth Bandit Optimization: Generalization to H\"older Space [37.15553727896912]
We consider bandit optimization of a smooth reward function, where the goal is cumulative regret. Our main result is in generalization of the reward function to H"older space with exponent $alpha>1$. We show that it achieves regret rate that matches the existing lower bound for adaptation within the $alphaleq 1$ subset.
arXiv Detail & Related papers (2020-12-11T01:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.