Related papers: Improved Algorithms for Conservative Exploration in Bandits

Improved Algorithms for Conservative Exploration in Bandits

URL: http://arxiv.org/abs/2002.03221v1
Date: Sat, 8 Feb 2020 19:35:01 GMT
Title: Improved Algorithms for Conservative Exploration in Bandits
Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
Abstract summary: We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2) We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
Score: 113.55554483194832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

Related papers

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem. It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z)
Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning [2.01030009289749]
Pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed. GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms.
arXiv Detail & Related papers (2024-04-03T08:03:27Z)
Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning problems in constrained decision processes with adversarial losses and hard constraints. We design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability.
arXiv Detail & Related papers (2024-03-06T12:49:08Z)
Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression [65.8785736964253]
We consider contextual bandits with linear constraints (CBwLC), a variant of contextual bandits in which the algorithm consumes multiple resources subject to linear constraints on total consumption. This problem generalizes contextual bandits with knapsacks (CBwK), allowing for packing and covering constraints, as well as positive and negative resource consumption. We provide the first algorithm for CBwLC (or CBwK) that is based on regression oracles. The algorithm is simple, computationally efficient, and statistically optimal under mild assumptions.
arXiv Detail & Related papers (2022-11-14T16:08:44Z)
Learning-Augmented Algorithms for Online Linear and Semidefinite Programming [9.849604820019394]
Semidefinite programming (SDP) is a unifying framework that generalizes both linear and quadratically-constrained programming. There exist known impossibility results for approxing the optimal solution when constraints for covering SDPs arrive in an online fashion. We show that if the predictor is accurate, we can efficiently bypass these impossibility results and achieve a constant-factor approximation to the optimal solution.
arXiv Detail & Related papers (2022-09-21T19:16:29Z)
Safe Online Bid Optimization with Return-On-Investment and Budget Constraints subject to Uncertainty [87.81197574939355]
We study the nature of both the optimization and learning problems. We provide an algorithm, namely GCB, guaranteeing sublinear regret at the cost of a potentially linear number of constraints violations. More interestingly, we provide an algorithm, namely GCB_safe(psi,phi), guaranteeing both sublinear pseudo-regret and safety w.h.p. at the cost of accepting tolerances psi and phi.
arXiv Detail & Related papers (2022-01-18T17:24:20Z)
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z)
Conservative Optimistic Policy Optimization via Multiple Importance Sampling [0.0]
Reinforcement Learning has been able to solve hard problems such as playing Atari games or solving the game of Go. Modern deep RL approaches are still not widely used in real-world applications.
arXiv Detail & Related papers (2021-03-04T20:23:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.