No Discounted-Regret Learning in Adversarial Bandits with Delays
- URL: http://arxiv.org/abs/2103.04550v1
- Date: Mon, 8 Mar 2021 05:15:31 GMT
- Title: No Discounted-Regret Learning in Adversarial Bandits with Delays
- Authors: Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet
- Abstract summary: We show that the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret"
For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria.
- Score: 40.670563413892154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consider a player that in each round $t$ out of $T$ rounds chooses an action
and observes the incurred cost after a delay of $d_{t}$ rounds. The cost
functions and the delay sequence are chosen by an adversary. We show that even
if the players' algorithms lose their "no regret" property due to too large
delays, the expected discounted ergodic distribution of play converges to the
set of coarse correlated equilibrium (CCE) if the algorithms have "no
discounted-regret". For a zero-sum game, we show that no discounted-regret is
sufficient for the discounted ergodic average of play to converge to the set of
Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a
regret of
$O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the
EXP3 algorithm with $K$ arms achieves a regret of $O\left(\sqrt{\ln
K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are
unknown. These bounds use a novel doubling trick that provably retains the
regret bound for when $D$ and $T$ are known. Using these bounds, we show that
EXP3 and FKM have no discounted-regret even for $d_{t}=O\left(t\log t\right)$.
Therefore, the CCE of a finite or convex unknown game can be approximated even
when only delayed bandit feedback is available via simulation.
Related papers
- Full Swap Regret and Discretized Calibration [18.944031222413294]
We study the problem of minimizing swap regret in structured normal-form games.
We introduce a new online learning problem we call emphfull swap regret minimization
We also apply these tools to the problem of online forecasting to calibration error.
arXiv Detail & Related papers (2025-02-13T13:49:52Z) - p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds.
Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback.
We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays.
We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z) - Better Best of Both Worlds Bounds for Bandits with Switching Costs [37.71741656687868]
We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021.
We introduce a surprisingly simple and effective that simultaneously achieves minimax optimal regret bound of $mathcalO(T2/3)$ in the oblivious adversarial setting.
arXiv Detail & Related papers (2022-06-07T08:22:56Z) - Bounded Memory Adversarial Bandits with Composite Anonymous Delayed
Feedback [10.179145161000315]
We study the adversarial bandit problem with composite anonymous delayed feedback.
In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen.
We show non-oblivious setting incurs $Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
arXiv Detail & Related papers (2022-04-27T08:32:35Z) - Uncoupled Learning Dynamics with $O(\log T)$ Swap Regret in Multiplayer
Games [121.50979258049135]
We show that when all players follow our dynamics with a emphtime-invariant learning rate, the emphsecond-order path lengths of the dynamics up to time $T$ are bounded by $O(log T)$.
Our proposed learning dynamics combine in a novel way emphoptimistic regularized learning with the use of emphself-concordant barriers.
arXiv Detail & Related papers (2022-04-25T03:20:16Z) - Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits.
The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair.
Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.