The Influence of Shape Constraints on the Thresholding Bandit Problem
- URL: http://arxiv.org/abs/2006.10006v3
- Date: Tue, 23 Feb 2021 09:27:50 GMT
- Title: The Influence of Shape Constraints on the Thresholding Bandit Problem
- Authors: James Cheshire, Pierre Menard, Alexandra Carpentier
- Abstract summary: We investigate the Thresholding Bandit problem (TBP) under several shape constraints.
In the TBP problem the aim is to output, at the end of the sequential game, the set of arms whose means are above a given threshold.
We prove that the minimax rates for the regret are (i) $sqrtlog(K)K/T$ for TBP, (ii) $sqrtlog(K)/T$ for UTBP, (iii) $sqrtK/T$ for CTBP and (iv)
- Score: 74.73898216415049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the stochastic Thresholding Bandit problem (TBP) under several
shape constraints. On top of (i) the vanilla, unstructured TBP, we consider the
case where (ii) the sequence of arm's means $(\mu_k)_k$ is monotonically
increasing MTBP, (iii) the case where $(\mu_k)_k$ is unimodal UTBP and (iv) the
case where $(\mu_k)_k$ is concave CTBP. In the TBP problem the aim is to
output, at the end of the sequential game, the set of arms whose means are
above a given threshold. The regret is the highest gap between a misclassified
arm and the threshold. In the fixed budget setting, we provide problem
independent minimax rates for the expected regret in all settings, as well as
associated algorithms. We prove that the minimax rates for the regret are (i)
$\sqrt{\log(K)K/T}$ for TBP, (ii) $\sqrt{\log(K)/T}$ for MTBP, (iii)
$\sqrt{K/T}$ for UTBP and (iv) $\sqrt{\log\log K/T}$ for CTBP, where $K$ is the
number of arms and $T$ is the budget. These rates demonstrate that the
dependence on $K$ of the minimax regret varies significantly depending on the
shape constraint. This highlights the fact that the shape constraints modify
fundamentally the nature of the TBP.
Related papers
- Improved Bound for Robust Causal Bandits with Linear Models [16.60875994745622]
This paper investigates the robustness of causal bandits in the face of temporal model fluctuations.
The proposed algorithm achieves nearly optimal $tildemathcalO(sqrtT)$ regret when $C$ is $o(sqrtT)$.
arXiv Detail & Related papers (2024-05-13T14:41:28Z) - Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem.
To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm.
We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z) - Robust Causal Bandits for Linear Models [20.028245872662843]
Sequential design of experiments for optimizing a reward function in causal systems can be effectively modeled by the sequential design of interventions in causal bandits (CBs)
This paper addresses the robustness of CBs to such model fluctuations.
Cumulative regret is adopted as the design criteria, based on which the objective is to design a sequence of interventions that incur the smallest cumulative regret with respect to an oracle aware of the entire causal model and its fluctuations.
arXiv Detail & Related papers (2023-10-30T17:58:01Z) - Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with
Probabilistically Triggered Arms or Independent Arms [53.89752069984382]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound.
First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals.
Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Problem Dependent View on Structured Thresholding Bandit Problems [73.70176003598449]
We investigate the problem dependent regime in the Thresholding Bandit problem (TBP)
The objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold.
We provide upper and lower bounds for the probability of error in both the concave and monotone settings.
arXiv Detail & Related papers (2021-06-18T15:01:01Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.