Stochastic Conservative Contextual Linear Bandits
- URL: http://arxiv.org/abs/2203.15629v1
- Date: Tue, 29 Mar 2022 14:50:50 GMT
- Title: Stochastic Conservative Contextual Linear Bandits
- Authors: Jiabin Lin, Xian Yeow Lee, Talukder Jubery, Shana Moothedath, Soumik
Sarkar, and Baskar Ganapathysubramanian
- Abstract summary: We study the problem of safe real-time decision making under uncertainty.
We formulate a conservative contextual bandit formulation for real-time decision making.
- Score: 8.684768561839146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many physical systems have underlying safety considerations that require that
the strategy deployed ensures the satisfaction of a set of constraints.
Further, often we have only partial information on the state of the system. We
study the problem of safe real-time decision making under uncertainty. In this
paper, we formulate a conservative stochastic contextual bandit formulation for
real-time decision making when an adversary chooses a distribution on the set
of possible contexts and the learner is subject to certain safety/performance
constraints. The learner observes only the context distribution and the exact
context is unknown, and the goal is to develop an algorithm that selects a
sequence of optimal actions to maximize the cumulative reward without violating
the safety constraints at any time step. By leveraging the UCB algorithm for
this setting, we propose a conservative linear UCB algorithm for stochastic
bandits with context distribution. We prove an upper bound on the regret of the
algorithm and show that it can be decomposed into three terms: (i) an upper
bound for the regret of the standard linear UCB algorithm, (ii) a constant term
(independent of time horizon) that accounts for the loss of being conservative
in order to satisfy the safety constraint, and (ii) a constant term
(independent of time horizon) that accounts for the loss for the contexts being
unknown and only the distribution being known. To validate the performance of
our approach we perform extensive simulations on synthetic data and on
real-world maize data collected through the Genomes to Fields (G2F) initiative.
Related papers
- Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time [0.6554326244334868]
We present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint.
We show that the learned policy is safe with high confidence.
We also demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
arXiv Detail & Related papers (2024-03-23T20:22:30Z) - Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints [0.0]
We propose a distributed upper confidence bound (UCB) algorithm, related-UCB.
Our algorithm constructs a pruned action set during each round to ensure the constraints are met.
We empirically validated the performance of our algorithm on synthetic data and real-world Movielens-100K data.
arXiv Detail & Related papers (2024-01-21T18:43:55Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Conservative Distributional Reinforcement Learning with Safety
Constraints [22.49025480735792]
Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained.
Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique.
We present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization.
arXiv Detail & Related papers (2022-01-18T19:45:43Z) - Safe Online Bid Optimization with Return-On-Investment and Budget
Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems.
We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations.
More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z) - Regret Bounds for Safe Gaussian Process Bandit Optimization [42.336882999112845]
In safety-critical systems, it is paramount that the learner's actions do not violate the safety constraints at any stage of the learning process.
We develop a safe variant of GP-UCB called SGP-UCB, with necessary modifications to respect safety constraints at every round.
arXiv Detail & Related papers (2020-05-05T03:54:43Z) - Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2)
We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.