ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive
Non-Stationary Dueling Bandits
- URL: http://arxiv.org/abs/2210.14322v1
- Date: Tue, 25 Oct 2022 20:26:02 GMT
- Title: ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive
Non-Stationary Dueling Bandits
- Authors: Thomas Kleine Buening and Aadirupa Saha
- Abstract summary: We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem.
We show a near-optimal $tildeO(sqrtStexttCW T)$ dynamic regret bound, where $StexttCW$ is the number of times the Condorcet winner changes in $T$ rounds.
- Score: 20.128001589147512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of non-stationary dueling bandits and provide the first
adaptive dynamic regret algorithm for this problem. The only two existing
attempts in this line of work fall short across multiple dimensions, including
pessimistic measures of non-stationary complexity and non-adaptive parameter
tuning that requires knowledge of the number of preference changes. We develop
an elimination-based rescheduling algorithm to overcome these shortcomings and
show a near-optimal $\tilde{O}(\sqrt{S^{\texttt{CW}} T})$ dynamic regret bound,
where $S^{\texttt{CW}}$ is the number of times the Condorcet winner changes in
$T$ rounds. This yields the first near-optimal dynamic regret algorithm for
unknown $S^{\texttt{CW}}$. We further study other related notions of
non-stationarity for which we also prove near-optimal dynamic regret guarantees
under additional assumptions on the underlying preference model.
Related papers
- An Equivalence Between Static and Dynamic Regret Minimization [10.812831455376218]
We show that for linear losses, dynamic regret minimization is equivalent to static regret minimization in an extended decision space.
We provide an algorithm guaranteeing dynamic regret of the form $R_T(u_1,dots,u_T)le tilde.
arXiv Detail & Related papers (2024-06-03T17:54:58Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Adaptive Regret for Bandits Made Possible: Two Queries Suffice [26.769372199571002]
We give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret.
Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $tildeO(sqrtn|I|)$ adaptive regret for multi-armed bandits.
arXiv Detail & Related papers (2024-01-17T15:32:04Z) - A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits [11.918230810566945]
We study the non-stationary multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning.
We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $widetilde O(sqrtK N(S+1))$ dynamic regret.
arXiv Detail & Related papers (2022-01-17T17:23:56Z) - Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization [70.4342220499858]
We introduce novel online algorithms that can exploit smoothness and replace the dependence on $T$ in dynamic regret with problem-dependent quantities.
Our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and safeguard the same rate in the worst case.
arXiv Detail & Related papers (2021-12-29T02:42:59Z) - Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary
Dueling Bandits [27.279654173896372]
We study the problem of emphdynamic regret minimization in $K$-armed Dueling Bandits under non-stationary or time varying preferences.
This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary win-loss' feedback for this pair.
arXiv Detail & Related papers (2021-11-06T16:46:55Z) - Non-stationary Reinforcement Learning without Prior Knowledge: An
Optimal Black-box Approach [42.021871809877595]
We present a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a near-stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment.
We show that our approach significantly improves the state of the art for linear bandits, episodic MDPs, and infinite-horizon MDPs.
arXiv Detail & Related papers (2021-02-10T12:43:31Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv Detail & Related papers (2020-07-13T16:30:47Z) - Dynamic Regret of Convex and Smooth Functions [93.71361250701075]
We investigate online convex optimization in non-stationary environments.
We choose the dynamic regret as the performance measure.
We show that it is possible to further enhance the dynamic regret by exploiting the smoothness condition.
arXiv Detail & Related papers (2020-07-07T14:10:57Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.