Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits
- URL: http://arxiv.org/abs/2301.13393v2
- Date: Fri, 2 Jun 2023 09:43:45 GMT
- Title: Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits
- Authors: Yunlong Hou, Vincent Y. F. Tan and Zixin Zhong
- Abstract summary: We propose an algorithm that minimizes the regret over the horizon of time $T$.
The proposed algorithm is applicable to domains such as recommendation systems and transportation.
- Score: 81.60136088841948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by concerns about making online decisions that incur undue amount
of risk at each time step, in this paper, we formulate the probably
anytime-safe stochastic combinatorial semi-bandits problem. In this problem,
the agent is given the option to select a subset of size at most $K$ from a set
of $L$ ground items. Each item is associated to a certain mean reward as well
as a variance that represents its risk. To mitigate the risk that the agent
incurs, we require that with probability at least $1-\delta$, over the entire
horizon of time $T$, each of the choices that the agent makes should contain
items whose sum of variances does not exceed a certain variance budget. We call
this probably anytime-safe constraint. Under this constraint, we design and
analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the
horizon of time $T$. By developing accompanying information-theoretic lower
bounds, we show that under both the problem-dependent and problem-independent
paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Experiments are
conducted to corroborate our theoretical findings. Our problem setup, the
proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to
domains such as recommendation systems and transportation in which an agent is
allowed to choose multiple items at a single time step and wishes to control
the risk over the whole time horizon.
Related papers
- Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs [7.125769932993104]
We consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion.
At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward.
We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm.
arXiv Detail & Related papers (2022-06-24T18:48:35Z) - Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes [36.104981594178525]
We study a variant of the contextual bandit problem where an agent can intervene through a set of expert policies.
We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts.
We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions.
arXiv Detail & Related papers (2021-07-07T14:58:14Z) - Problem Dependent View on Structured Thresholding Bandit Problems [73.70176003598449]
We investigate the problem dependent regime in the Thresholding Bandit problem (TBP)
The objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold.
We provide upper and lower bounds for the probability of error in both the concave and monotone settings.
arXiv Detail & Related papers (2021-06-18T15:01:01Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - A Multi-Arm Bandit Approach To Subset Selection Under Constraints [14.543433698096667]
We explore the class of problems where a central planner needs to select a subset of agents, each with different quality and cost.
When the agents' quality is known, we formulate our problem as an integer linear program (ILP)
We propose a deterministic algorithm, namely dpss, that provides an exact solution to our ILP.
arXiv Detail & Related papers (2021-02-09T13:48:57Z) - Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not.
We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively.
In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.