Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs
- URL: http://arxiv.org/abs/2206.12463v1
- Date: Fri, 24 Jun 2022 18:48:35 GMT
- Title: Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs
- Authors: Yifan Lin, Yuhao Wang, Enlu Zhou
- Abstract summary: We consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion.
At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward.
We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm.
- Score: 7.125769932993104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we consider the contextual multi-armed bandit problem for
linear payoffs under a risk-averse criterion. At each round, contexts are
revealed for each arm, and the decision maker chooses one arm to pull and
receives the corresponding reward. In particular, we consider mean-variance as
the risk criterion, and the best arm is the one with the largest mean-variance
reward. We apply the Thompson Sampling algorithm for the disjoint model, and
provide a comprehensive regret analysis for a variant of the proposed
algorithm. For $T$ rounds, $K$ actions, and $d$-dimensional feature vectors, we
prove a regret bound of $O((1+\rho+\frac{1}{\rho}) d\ln T \ln
\frac{K}{\delta}\sqrt{d K T^{1+2\epsilon} \ln \frac{K}{\delta}
\frac{1}{\epsilon}})$ that holds with probability $1-\delta$ under the
mean-variance criterion with risk tolerance $\rho$, for any
$0<\epsilon<\frac{1}{2}$, $0<\delta<1$. The empirical performance of our
proposed algorithms is demonstrated via a portfolio selection problem.
Related papers
- Best-of-Both-Worlds Linear Contextual Bandits [45.378265414553226]
This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption.
We develop a strategy that is effective in both adversarial environments, with theoretical guarantees.
We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both adversarial regimes.
arXiv Detail & Related papers (2023-12-27T09:32:18Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits [81.60136088841948]
We propose an algorithm that minimizes the regret over the horizon of time $T$.
The proposed algorithm is applicable to domains such as recommendation systems and transportation.
arXiv Detail & Related papers (2023-01-31T03:49:00Z) - Double Doubly Robust Thompson Sampling for Generalized Linear Contextual
Bandits [8.508198765617198]
We propose a novel contextual bandit algorithm for generalized linear rewards with an $tildeO(sqrtkappa-1 phi T)$ regret over $T$ rounds.
We also provide an $O(kappa-1 phi log (NT) log T)$ regret bound for $N$ arms under a probabilistic margin condition.
arXiv Detail & Related papers (2022-09-15T00:20:38Z) - Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption.
We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.