Contextual Combinatorial Volatile Bandits via Gaussian Processes
- URL: http://arxiv.org/abs/2110.02248v1
- Date: Tue, 5 Oct 2021 18:02:10 GMT
- Title: Contextual Combinatorial Volatile Bandits via Gaussian Processes
- Authors: Andi Nika, Sepehr Elahi, Cem Tekin
- Abstract summary: We consider a contextual bandit problem with a set of available base arms and their contexts.
We propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB)
We experimentally show that both algorithms vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.
- Score: 10.312968200748116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a contextual bandit problem with a combinatorial action set and
time-varying base arm availability. At the beginning of each round, the agent
observes the set of available base arms and their contexts and then selects an
action that is a feasible subset of the set of available base arms to maximize
its cumulative reward in the long run. We assume that the mean outcomes of base
arms are samples from a Gaussian Process indexed by the context set ${\cal X}$,
and the expected reward is Lipschitz continuous in expected base arm outcomes.
For this setup, we propose an algorithm called Optimistic Combinatorial
Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and
prove that it incurs $\tilde{O}(K\sqrt{T\overline{\gamma}_{T}} )$ regret with
high probability, where $\overline{\gamma}_{T}$ is the maximum information gain
associated with the set of base arm contexts that appeared in the first $T$
rounds and $K$ is the maximum cardinality of any feasible action over all
rounds. To dramatically speed up the algorithm, we also propose a variant of
O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both
algorithms exploit inter-base arm outcome correlation and vastly outperform the
previous state-of-the-art UCB-based algorithms in realistic setups.
Related papers
- Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [45.305256056479045]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with
Probabilistically Triggered Arms or Independent Arms [53.89752069984382]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound.
First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals.
Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z) - Contextual Combinatorial Multi-output GP Bandits with Group Constraints [11.317136648551537]
In federated multi-armed bandit problems, maximizing global reward while satisfying minimum privacy requirements to protect clients is the main goal.
We consider a contextual bandit setting with groups and changing action sets, where similar base arms arrive in groups and a set of base arms, called a super arm, must be chosen in each round to maximize super arm reward while satisfying the constraints of the rewards of groups from which base arms were chosen.
We then propose a novel double-UCB GP-bandit algorithm, called Thresholded Combinatored Upper Confidence Bounds (TCGP-UCB), which balances between maximizing cumulative super arm reward and satisfying
arXiv Detail & Related papers (2021-11-29T18:39:09Z) - Sleeping Combinatorial Bandits [15.004764451770441]
We adapt the well-known CUCB algorithm in the sleeping bandits setting and refer to it as CSUCB.
We prove -- under mild conditions -- that the CSUCB algorithm achieves an $O(sqrtT log (T)$ instance-dependent regret guarantee.
Our results are quite general and hold under general environments -- such as non-additive reward functions, volatile arm availability, a variable number of base-arms to be pulled -- arising in practical applications.
arXiv Detail & Related papers (2021-06-03T06:49:44Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.