Related papers: Stage-wise Conservative Linear Bandits

Stage-wise Conservative Linear Bandits

URL: http://arxiv.org/abs/2010.00081v1
Date: Wed, 30 Sep 2020 19:51:37 GMT
Title: Stage-wise Conservative Linear Bandits
Authors: Ahmadreza Moradipari, Christos Thrampoulidis, Mahnoosh Alizadeh
Abstract summary: We study bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials. We present two novel algorithms that respect the baseline constraints and enjoy probabilistic regret bounds of order O(sqrtT log T) Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations.
Score: 37.717532659194426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study stage-wise conservative linear stochastic bandits: an instance of bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials. At each stage, the learner must choose actions that not only maximize cumulative reward across the entire time horizon but further satisfy a linear baseline constraint that takes the form of a lower bound on the instantaneous reward. For this problem, we present two novel algorithms, stage-wise conservative linear Thompson Sampling (SCLTS) and stage-wise conservative linear UCB (SCLUCB), that respect the baseline constraints and enjoy probabilistic regret bounds of order O(\sqrt{T} \log^{3/2}T) and O(\sqrt{T} \log T), respectively. Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations, such as constraints with bandit-feedback, or an unknown sequence of baseline actions. We discuss these and other improvements over the state-of-the-art. For instance, compared to existing solutions, we show that SCLTS plays the (non-optimal) baseline action at most O(\log{T}) times (compared to O(\sqrt{T})). Finally, we make connections to another studied form of safety constraints that takes the form of an upper bound on the instantaneous reward. While this incurs additional complexity to the learning process as the optimal action is not guaranteed to belong to the safe set at each round, we show that SCLUCB can properly adjust in this setting via a simple modification.

Related papers

Constrained Linear Thompson Sampling [39.724313550777715]
We study the safe linear bandit problem, where an agent sequentially selects actions from a convex domain to maximize an unknown objective. Existing approaches rely on optimism-based methods with frequentist confidence bounds, often leading to computationally expensive action selection routines. We propose COnstrained Linear Thompson Sampling (COLTS), a sampling-based framework that efficiently balances regret minimization and constraint satisfaction.
arXiv Detail & Related papers (2025-03-03T20:44:58Z)
Reinforcement learning with combinatorial actions for coupled restless bandits [62.89013331120493]
We propose SEQUOIA, an RL algorithm that directly optimize for long-term reward over the feasible action space. We empirically validate SEQUOIA on four novel restless bandit problems with constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints.
arXiv Detail & Related papers (2025-03-01T21:25:21Z)
Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints [8.784438985280094]
We study problems as pure exploration in multi-armed bandits with unknown linear constraints. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. Second, we leverage the Lagrangian lower bound and the properties of convex to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX.
arXiv Detail & Related papers (2024-10-24T15:26:14Z)
Beyond Primal-Dual Methods in Bandits with Stochastic and Adversarial Constraints [29.514323697659613]
We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints. Our goal is to design best-of-both-worlds algorithms that perform under both and adversarial constraints.
arXiv Detail & Related papers (2024-05-25T08:09:36Z)
Batched Nonparametric Contextual Bandits [21.031965676746776]
We study nonparametric contextual bandits under batch constraints. We propose a novel batch learning algorithm that achieves the optimal regret. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret.
arXiv Detail & Related papers (2024-02-27T18:06:20Z)
Interactively Learning Preference Constraints in Linear Bandits [100.78514640066565]
We study sequential decision-making with known rewards and unknown constraints. As an application, we consider learning constraints to represent human preferences in a driving simulation.
arXiv Detail & Related papers (2022-06-10T17:52:58Z)
On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function. We propose a general framework for both algorithm performance analysis. We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z)
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
On Lower Bounds for Standard and Robust Gaussian Process Bandit Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm. We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Thompson Sampling for Linearly Constrained Bandits [18.477962150017095]
We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
arXiv Detail & Related papers (2020-04-20T13:06:35Z)
Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2) We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.