Nonstochastic Bandits with Composite Anonymous Feedback
- URL: http://arxiv.org/abs/2112.02866v1
- Date: Mon, 6 Dec 2021 08:44:04 GMT
- Title: Nonstochastic Bandits with Composite Anonymous Feedback
- Authors: Nicol\`o Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Claudio
Gentile, Yishay Mansour
- Abstract summary: We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player.
The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions.
- Score: 41.38921728211769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate a nonstochastic bandit setting in which the loss of an action
is not immediately charged to the player, but rather spread over the subsequent
rounds in an adversarial way. The instantaneous loss observed by the player at
the end of each round is then a sum of many loss components of previously
played actions. This setting encompasses as a special case the easier task of
bandits with delayed feedback, a well-studied framework where the player
observes the delayed losses individually.
Our first contribution is a general reduction transforming a standard bandit
algorithm into one that can operate in the harder setting: We bound the regret
of the transformed algorithm in terms of the stability and regret of the
original algorithm. Then, we show that the transformation of a suitably tuned
FTRL with Tsallis entropy has a regret of order $\sqrt{(d+1)KT}$, where $d$ is
the maximum delay, $K$ is the number of arms, and $T$ is the time horizon.
Finally, we show that our results cannot be improved in general by exhibiting a
matching (up to a log factor) lower bound on the regret of any algorithm
operating in this setting.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Non-Stationary Bandits under Recharging Payoffs: Improved Planning with
Sublinear Regret [34.44347218903429]
The multi-armed bandit setting has been recently studied in the non-stationary regime.
Mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played.
We show how our algorithm can be transformed into a bandit algorithm with sublinear regret.
arXiv Detail & Related papers (2022-05-29T23:55:36Z) - A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian
Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants.
Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation.
We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z) - Nonstochastic Bandits and Experts with Arm-Dependent Delays [17.272515865592542]
We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms.
Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
arXiv Detail & Related papers (2021-11-02T13:36:11Z) - On Slowly-varying Non-stationary Bandits [25.305949034527202]
We consider dynamic regret in non-stationary bandits with a slowly varying property.
We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits.
We show that our algorithm is essentially minimax optimal.
arXiv Detail & Related papers (2021-10-25T12:56:19Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions.
We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
arXiv Detail & Related papers (2020-10-12T20:53:52Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.