Stage-wise Conservative Linear Bandits
- URL: http://arxiv.org/abs/2010.00081v1
- Date: Wed, 30 Sep 2020 19:51:37 GMT
- Title: Stage-wise Conservative Linear Bandits
- Authors: Ahmadreza Moradipari, Christos Thrampoulidis, Mahnoosh Alizadeh
- Abstract summary: We study bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials.
We present two novel algorithms that respect the baseline constraints and enjoy probabilistic regret bounds of order O(sqrtT log T)
Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations.
- Score: 37.717532659194426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study stage-wise conservative linear stochastic bandits: an instance of
bandit optimization, which accounts for (unknown) safety constraints that
appear in applications such as online advertising and medical trials. At each
stage, the learner must choose actions that not only maximize cumulative reward
across the entire time horizon but further satisfy a linear baseline constraint
that takes the form of a lower bound on the instantaneous reward. For this
problem, we present two novel algorithms, stage-wise conservative linear
Thompson Sampling (SCLTS) and stage-wise conservative linear UCB (SCLUCB), that
respect the baseline constraints and enjoy probabilistic regret bounds of order
O(\sqrt{T} \log^{3/2}T) and O(\sqrt{T} \log T), respectively. Notably, the
proposed algorithms can be adjusted with only minor modifications to tackle
different problem variations, such as constraints with bandit-feedback, or an
unknown sequence of baseline actions. We discuss these and other improvements
over the state-of-the-art. For instance, compared to existing solutions, we
show that SCLTS plays the (non-optimal) baseline action at most O(\log{T})
times (compared to O(\sqrt{T})). Finally, we make connections to another
studied form of safety constraints that takes the form of an upper bound on the
instantaneous reward. While this incurs additional complexity to the learning
process as the optimal action is not guaranteed to belong to the safe set at
each round, we show that SCLUCB can properly adjust in this setting via a
simple modification.
Related papers
- Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints [8.784438985280094]
We study problems as pure exploration in multi-armed bandits with unknown linear constraints.
First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints.
Second, we leverage the Lagrangian lower bound and the properties of convex to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX.
arXiv Detail & Related papers (2024-10-24T15:26:14Z) - Beyond Primal-Dual Methods in Bandits with Stochastic and Adversarial Constraints [29.514323697659613]
We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints.
Our goal is to design best-of-both-worlds algorithms that perform under both and adversarial constraints.
arXiv Detail & Related papers (2024-05-25T08:09:36Z) - Batched Nonparametric Contextual Bandits [21.031965676746776]
We study nonparametric contextual bandits under batch constraints.
We propose a novel batch learning algorithm that achieves the optimal regret.
Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret.
arXiv Detail & Related papers (2024-02-27T18:06:20Z) - Interactively Learning Preference Constraints in Linear Bandits [100.78514640066565]
We study sequential decision-making with known rewards and unknown constraints.
As an application, we consider learning constraints to represent human preferences in a driving simulation.
arXiv Detail & Related papers (2022-06-10T17:52:58Z) - On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function.
We propose a general framework for both algorithm performance analysis.
We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - On Lower Bounds for Standard and Robust Gaussian Process Bandit
Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm.
We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Thompson Sampling for Linearly Constrained Bandits [18.477962150017095]
We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round.
We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
arXiv Detail & Related papers (2020-04-20T13:06:35Z) - Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2)
We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.