Bounded Memory Adversarial Bandits with Composite Anonymous Delayed
Feedback
- URL: http://arxiv.org/abs/2204.12764v2
- Date: Thu, 28 Apr 2022 02:28:34 GMT
- Title: Bounded Memory Adversarial Bandits with Composite Anonymous Delayed
Feedback
- Authors: Zongqi Wan, Xiaoming Sun, Jialin Zhang
- Abstract summary: We study the adversarial bandit problem with composite anonymous delayed feedback.
In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen.
We show non-oblivious setting incurs $Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
- Score: 10.179145161000315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the adversarial bandit problem with composite anonymous delayed
feedback. In this setting, losses of an action are split into $d$ components,
spreading over consecutive rounds after the action is chosen. And in each
round, the algorithm observes the aggregation of losses that come from the
latest $d$ rounds. Previous works focus on oblivious adversarial setting, while
we investigate the harder non-oblivious setting. We show non-oblivious setting
incurs $\Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on
many adversarial bandit problems with the assumption that the loss sequence is
bounded memory. Especially, for $K$-armed bandit and bandit convex
optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove
a matching lower bound for $K$-armed bandit. Our lower bound works even when
the loss sequence is oblivious but the delay is non-oblivious. It answers the
open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious
delay is enough to incur $\tilde{\Omega}(T^{2/3})$ regret.
Related papers
- Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Non-Stationary Dueling Bandits [8.298716599039501]
We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments.
To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible.
We show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
arXiv Detail & Related papers (2022-02-02T09:57:35Z) - No Discounted-Regret Learning in Adversarial Bandits with Delays [40.670563413892154]
We show that the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret"
For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria.
arXiv Detail & Related papers (2021-03-08T05:15:31Z) - Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits.
The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair.
Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.