Related papers: No Discounted-Regret Learning in Adversarial Bandits with Delays

No Discounted-Regret Learning in Adversarial Bandits with Delays

URL: http://arxiv.org/abs/2103.04550v1
Date: Mon, 8 Mar 2021 05:15:31 GMT
Title: No Discounted-Regret Learning in Adversarial Bandits with Delays
Authors: Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet
Abstract summary: We show that the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret" For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria.
Score: 40.670563413892154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Consider a player that in each round $t$ out of $T$ rounds chooses an action and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that even if the players' algorithms lose their "no regret" property due to too large delays, the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret". For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves a regret of $O\left(\sqrt{\ln K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that EXP3 and FKM have no discounted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, the CCE of a finite or convex unknown game can be approximated even when only delayed bandit feedback is available via simulation.

Related papers

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound. Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z)
Full Swap Regret and Discretized Calibration [18.944031222413294]
We study the problem of minimizing swap regret in structured normal-form games. We introduce a new online learning problem we call emphfull swap regret minimization We also apply these tools to the problem of online forecasting to calibration error.
arXiv Detail & Related papers (2025-02-13T13:49:52Z)
p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds. Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z)
Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round. We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay. We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general. We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z)
Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits [3.5955736977697073]
In the single-pass setting with $K$ arms and $T$ trials, a regret lower bound of $Omega(T2/3)$ has been proved for any algorithm with $o(K)$ memory. In this paper, we improve the regret lower bound to $Omega(K/3log/3(T))$ for algorithms with $o(K)$ memory. We show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.
arXiv Detail & Related papers (2023-06-03T22:41:44Z)
A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback. We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays. We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z)
Better Best of Both Worlds Bounds for Bandits with Switching Costs [37.71741656687868]
We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective that simultaneously achieves minimax optimal regret bound of $mathcalO(T2/3)$ in the oblivious adversarial setting.
arXiv Detail & Related papers (2022-06-07T08:22:56Z)
Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback [10.179145161000315]
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. We show non-oblivious setting incurs $Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
arXiv Detail & Related papers (2022-04-27T08:32:35Z)
Uncoupled Learning Dynamics with $O(\log T)$ Swap Regret in Multiplayer Games [121.50979258049135]
We show that when all players follow our dynamics with a emphtime-invariant learning rate, the emphsecond-order path lengths of the dynamics up to time $T$ are bounded by $O(log T)$. Our proposed learning dynamics combine in a novel way emphoptimistic regularized learning with the use of emphself-concordant barriers.
arXiv Detail & Related papers (2022-04-25T03:20:16Z)
Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits. The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair. Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z)
Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems. We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.