Best-of-Both-Worlds Linear Contextual Bandits
- URL: http://arxiv.org/abs/2312.16489v1
- Date: Wed, 27 Dec 2023 09:32:18 GMT
- Title: Best-of-Both-Worlds Linear Contextual Bandits
- Authors: Masahiro Kato and Shinji Ito
- Abstract summary: This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption.
We develop a strategy that is effective in both adversarial environments, with theoretical guarantees.
We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both adversarial regimes.
- Score: 45.378265414553226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates the problem of $K$-armed linear contextual bandits,
an instance of the multi-armed bandit problem, under an adversarial corruption.
At each round, a decision-maker observes an independent and identically
distributed context and then selects an arm based on the context and past
observations. After selecting an arm, the decision-maker incurs a loss
corresponding to the selected arm. The decision-maker aims to minimize the
cumulative loss over the trial. The goal of this study is to develop a strategy
that is effective in both stochastic and adversarial environments, with
theoretical guarantees. We first formulate the problem by introducing a novel
setting of bandits with adversarial corruption, referred to as the contextual
adversarial regime with a self-bounding constraint. We assume linear models for
the relationship between the loss and the context. Then, we propose a strategy
that extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the
Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is
shown to be upper-bounded by $O\left(\min\left\{\frac{(\log(T))^3}{\Delta_{*}}
+ \sqrt{\frac{C(\log(T))^3}{\Delta_{*}}},\ \
\sqrt{T}(\log(T))^2\right\}\right)$, where $T \in\mathbb{N}$ is the number of
rounds, $\Delta_{*} > 0$ is the constant minimum gap between the best and
suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption
parameter. This regret upper bound implies
$O\left(\frac{(\log(T))^3}{\Delta_{*}}\right)$ in a stochastic environment and
by $O\left( \sqrt{T}(\log(T))^2\right)$ in an adversarial environment. We refer
to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its
theoretical guarantees in both stochastic and adversarial regimes.
Related papers
- LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem.
To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm.
We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z) - Corruption-Robust Offline Reinforcement Learning with General Function
Approximation [60.91257031278004]
We investigate the problem of corruption in offline reinforcement learning (RL) with general function approximation.
Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-10-23T04:07:26Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption.
We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z) - An Algorithm for Stochastic and Adversarial Bandits with Switching Costs [10.549307055348596]
We propose an algorithm for adversarial multiarmed bandits with switching costs, where the algorithm pays a price $lambda$ every time it switches the arm being played.
Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon.
arXiv Detail & Related papers (2021-02-19T11:03:51Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.