Smooth Non-Stationary Bandits
- URL: http://arxiv.org/abs/2301.12366v3
- Date: Sun, 17 Nov 2024 18:03:40 GMT
- Title: Smooth Non-Stationary Bandits
- Authors: Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier,
- Abstract summary: We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $beta$-H"older function.
We show the first separation between the smooth (i.e., $betage 2$) and non-smooth (i.e., $beta=1$) regimes by presenting a policy with $tilde O(k4/5 T3/5)$ regret on any $k$-armed, $2$-H"older instance.
- Score: 49.19728527803684
- License:
- Abstract: In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $\beta$ increases. When $\beta=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde \Theta(T^{2/3})$. We show the first separation between the smooth (i.e., $\beta\ge 2$) and non-smooth (i.e., $\beta=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-H\"older instance. We complement this result by showing that the minimax regret on the $\beta$-H\"older family of instances is $\Omega(T^{(\beta+1)/(2\beta+1)})$ for any integer $\beta\ge 1$. This matches our upper bound for $\beta=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
Related papers
- LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Tight Memory-Regret Lower Bounds for Streaming Bandits [11.537938617281736]
learner aims to minimize regret by dealing with online arriving arms and sublinear arm memory.
We establish the tight worst-case regret lower bound of $Omega left( (TB)alpha K1-alpharight), alpha = 2B / (2B+1-1)$ for any algorithm.
We also provide a multi-pass algorithm that achieves a regret upper bound of $tildeO left( (TB)alpha K1 - alpharight)$ using constant arm memory.
arXiv Detail & Related papers (2023-06-13T16:54:13Z) - On the Minimax Regret for Online Learning with Feedback Graphs [5.721380617450645]
We improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs.
Our improved upper bound $mathcalObigl(sqrtalpha T(ln K)/(lnalpha)bigr)$ holds for any $alpha$ and matches the lower bounds for bandits and experts.
arXiv Detail & Related papers (2023-05-24T17:40:57Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Smooth Bandit Optimization: Generalization to H\"older Space [37.15553727896912]
We consider bandit optimization of a smooth reward function, where the goal is cumulative regret.
Our main result is in generalization of the reward function to H"older space with exponent $alpha>1$.
We show that it achieves regret rate that matches the existing lower bound for adaptation within the $alphaleq 1$ subset.
arXiv Detail & Related papers (2020-12-11T01:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.