Adaptive Regret for Bandits Made Possible: Two Queries Suffice
        - URL: http://arxiv.org/abs/2401.09278v1
 - Date: Wed, 17 Jan 2024 15:32:04 GMT
 - Title: Adaptive Regret for Bandits Made Possible: Two Queries Suffice
 - Authors: Zhou Lu, Qiuyi Zhang, Xinyi Chen, Fred Zhang, David Woodruff, Elad
  Hazan
 - Abstract summary: We give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret.
Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $tildeO(sqrtn|I|)$ adaptive regret for multi-armed bandits.
 - Score: 26.769372199571002
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Fast changing states or volatile environments pose a significant challenge to
online optimization, which needs to perform rapid adaptation under limited
observation. In this paper, we give query and regret optimal bandit algorithms
under the strict notion of strongly adaptive regret, which measures the maximum
regret over any contiguous interval $I$. Due to its worst-case nature, there is
an almost-linear $\Omega(|I|^{1-\epsilon})$ regret lower bound, when only one
query per round is allowed [Daniely el al, ICML 2015]. Surprisingly, with just
two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that
achieves $\tilde{O}(\sqrt{n|I|})$ adaptive regret for multi-armed bandits with
$n$ arms. The bound is tight and cannot be improved in general. Our algorithm
leverages a multiplicative update scheme of varying stepsizes and a carefully
chosen observation distribution to control the variance. Furthermore, we extend
our results and provide optimal algorithms in the bandit convex optimization
setting. Finally, we empirically demonstrate the superior performance of our
algorithms under volatile environments and for downstream tasks, such as
algorithm selection for hyperparameter optimization.
 
       
      
        Related papers
        - Second Order Methods for Bandit Optimization and Control [34.51425758864638]
We show that our algorithm achieves optimal (in terms of terms of convex functions that we call $kappa$-2020) regret bounds for a large class of convex functions.
We also investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory.
arXiv  Detail & Related papers  (2024-02-14T04:03:38Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv  Detail & Related papers  (2023-12-13T11:08:25Z) - Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive
  Regret Bounds [27.92755687977196]
The paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.
At the higher level, the proposed algorithm adapts to a variety of types of environments.
The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action.
arXiv  Detail & Related papers  (2023-02-24T00:02:24Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
  Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv  Detail & Related papers  (2023-02-21T00:17:24Z) - ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive
  Non-Stationary Dueling Bandits [20.128001589147512]
We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem.
We show a near-optimal $tildeO(sqrtStexttCW T)$ dynamic regret bound, where $StexttCW$ is the number of times the Condorcet winner changes in $T$ rounds.
arXiv  Detail & Related papers  (2022-10-25T20:26:02Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
  Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv  Detail & Related papers  (2021-11-24T07:14:57Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv  Detail & Related papers  (2021-06-02T22:03:36Z) - Lazy OCO: Online Convex Optimization on a Switching Budget [34.936641201844054]
We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds.
Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary.
arXiv  Detail & Related papers  (2021-02-07T14:47:19Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv  Detail & Related papers  (2020-08-10T08:30:52Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
  Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv  Detail & Related papers  (2020-07-13T16:30:47Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv  Detail & Related papers  (2020-07-07T09:00:57Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.