A Unified Framework for Conservative Exploration
- URL: http://arxiv.org/abs/2106.11692v1
- Date: Tue, 22 Jun 2021 11:52:04 GMT
- Title: A Unified Framework for Conservative Exploration
- Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta,
Alessandro Lazaric, Liwei Wang, Simon S. Du
- Abstract summary: We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy.
We present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy.
- Score: 115.7063101600773
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study bandits and reinforcement learning (RL) subject to a conservative
constraint where the agent is asked to perform at least as well as a given
baseline policy. This setting is particular relevant in real-world domains
including digital marketing, healthcare, production, finance, etc. For
multi-armed bandits, linear bandits and tabular RL, specialized algorithms and
theoretical analyses were proposed in previous work. In this paper, we present
a unified framework for conservative bandits and RL, in which our core
technique is to calculate the necessary and sufficient budget obtained from
running the baseline policy. For lower bounds, our framework gives a black-box
reduction that turns a certain lower bound in the nonconservative setting into
a new lower bound in the conservative setting. We strengthen the existing lower
bound for conservative multi-armed bandits and obtain new lower bounds for
conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our
framework turns a certain nonconservative upper-confidence-bound (UCB)
algorithm into a conservative algorithm with a simple analysis. For multi-armed
bandits, linear bandits and tabular RL, our new upper bounds tighten or match
existing ones with significantly simpler analyses. We also obtain a new upper
bound for conservative low-rank MDP.
Related papers
- Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning [69.39357308375212]
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms.
Recent studies have shown that non-conservative algorithms outperform conservative ones.
arXiv Detail & Related papers (2021-02-27T02:29:01Z) - A One-Size-Fits-All Solution to Conservative Bandit Problems [32.907883501286]
We study a family of conservative bandit problems (CBPs) with sample-path reward constraints.
We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems.
arXiv Detail & Related papers (2020-12-14T08:50:23Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Meta-Learning Bandit Policies by Gradient Ascent [38.817374110000735]
bandit policies are designed to either minimize regret in any problem instance, or in a Bayesian sense, assuming a prior distribution over environment parameters.
We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances.
We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients.
arXiv Detail & Related papers (2020-06-09T07:45:41Z) - Locally Differentially Private (Contextual) Bandits Learning [55.63825598391525]
We study locally differentially private (LDP) bandits learning in this paper.
We propose simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee.
arXiv Detail & Related papers (2020-06-01T04:02:00Z) - Improved Algorithms for Conservative Exploration in Bandits [113.55554483194832]
We study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2)
We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems.
arXiv Detail & Related papers (2020-02-08T19:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.