Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits
- URL: http://arxiv.org/abs/2301.13393v2
- Date: Fri, 2 Jun 2023 09:43:45 GMT
- Title: Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits
- Authors: Yunlong Hou, Vincent Y. F. Tan and Zixin Zhong
- Abstract summary: We propose an algorithm that minimizes the regret over the horizon of time $T$.
The proposed algorithm is applicable to domains such as recommendation systems and transportation.
- Score: 81.60136088841948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by concerns about making online decisions that incur undue amount
of risk at each time step, in this paper, we formulate the probably
anytime-safe stochastic combinatorial semi-bandits problem. In this problem,
the agent is given the option to select a subset of size at most $K$ from a set
of $L$ ground items. Each item is associated to a certain mean reward as well
as a variance that represents its risk. To mitigate the risk that the agent
incurs, we require that with probability at least $1-\delta$, over the entire
horizon of time $T$, each of the choices that the agent makes should contain
items whose sum of variances does not exceed a certain variance budget. We call
this probably anytime-safe constraint. Under this constraint, we design and
analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the
horizon of time $T$. By developing accompanying information-theoretic lower
bounds, we show that under both the problem-dependent and problem-independent
paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Experiments are
conducted to corroborate our theoretical findings. Our problem setup, the
proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to
domains such as recommendation systems and transportation in which an agent is
allowed to choose multiple items at a single time step and wishes to control
the risk over the whole time horizon.
Related papers
- Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs [7.125769932993104]
We consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion.
At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward.
We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm.
arXiv Detail & Related papers (2022-06-24T18:48:35Z) - Problem Dependent View on Structured Thresholding Bandit Problems [73.70176003598449]
We investigate the problem dependent regime in the Thresholding Bandit problem (TBP)
The objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold.
We provide upper and lower bounds for the probability of error in both the concave and monotone settings.
arXiv Detail & Related papers (2021-06-18T15:01:01Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - A Multi-Arm Bandit Approach To Subset Selection Under Constraints [14.543433698096667]
We explore the class of problems where a central planner needs to select a subset of agents, each with different quality and cost.
When the agents' quality is known, we formulate our problem as an integer linear program (ILP)
We propose a deterministic algorithm, namely dpss, that provides an exact solution to our ILP.
arXiv Detail & Related papers (2021-02-09T13:48:57Z) - Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed
Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$.
We propose a novel robust estimator which does not require $nu_p$ as prior information.
We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.