Sparsity-Agnostic Linear Bandits with Adaptive Adversaries
- URL: http://arxiv.org/abs/2406.01192v1
- Date: Mon, 3 Jun 2024 10:54:58 GMT
- Title: Sparsity-Agnostic Linear Bandits with Adaptive Adversaries
- Authors: Tianyuan Jin, Kyoungseok Jang, Nicolò Cesa-Bianchi,
- Abstract summary: We study linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors) from which it chooses an element and obtains a reward.
The expected reward is a fixed but unknown linear function of the chosen action.
We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function.
- Score: 19.84322270472381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works focused on the case where $S$ is known, or the action sets satisfy additional assumptions. In this work, we obtain the first sparse regret bounds that hold when $S$ is unknown and the action sets are adversarially generated. Our techniques combine online to confidence set conversions with a novel randomized model selection approach over a hierarchy of nested confidence sets. When $S$ is known, our analysis recovers state-of-the-art bounds for adversarial action sets. We also show that a variant of our approach, using Exp3 to dynamically select the confidence sets, can be used to improve the empirical performance of stochastic linear bandits while enjoying a regret bound with optimal dependence on the time horizon.
Related papers
- Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System [0.0]
Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a multi-armed bandit problem.
We propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward.
We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward.
arXiv Detail & Related papers (2024-05-15T05:33:49Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Linear Bandits with Memory: from Rotting to Rising [5.5969337839476765]
Nonstationary phenomena, such as satiation effects in recommendations, have mostly been modeled using bandits with finitely many arms.
We introduce a novel nonstationary linear bandit model, where current rewards are influenced by the learner's past actions in a fixed-size window.
arXiv Detail & Related papers (2023-02-16T15:02:07Z) - PopArt: Efficient Sparse Regression and Experimental Design for Optimal
Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt.
We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z) - A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits [11.918230810566945]
We study the non-stationary multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning.
We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $widetilde O(sqrtK N(S+1))$ dynamic regret.
arXiv Detail & Related papers (2022-01-17T17:23:56Z) - Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations [102.32996053572144]
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent evaluations of the true reward of each arm.
We derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated.
arXiv Detail & Related papers (2021-12-13T09:48:54Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.