Meta-Learning Bandit Policies by Gradient Ascent
- URL: http://arxiv.org/abs/2006.05094v2
- Date: Wed, 6 Jan 2021 03:44:39 GMT
- Title: Meta-Learning Bandit Policies by Gradient Ascent
- Authors: Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba
Szepesvari, and Craig Boutilier
- Abstract summary: bandit policies are designed to either minimize regret in any problem instance, or in a Bayesian sense, assuming a prior distribution over environment parameters.
We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances.
We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients.
- Score: 38.817374110000735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most bandit policies are designed to either minimize regret in any problem
instance, making very few assumptions about the underlying environment, or in a
Bayesian sense, assuming a prior distribution over environment parameters. The
former are often too conservative in practical settings, while the latter
require assumptions that are hard to verify in practice. We study bandit
problems that fall between these two extremes, where the learning agent has
access to sampled bandit instances from an unknown prior distribution
$\mathcal{P}$ and aims to achieve high reward on average over the bandit
instances drawn from $\mathcal{P}$. This setting is of a particular importance
because it lays foundations for meta-learning of bandit policies and reflects
more realistic assumptions in many practical domains. We propose the use of
parameterized bandit policies that are differentiable and can be optimized
using policy gradients. This provides a broadly applicable framework that is
easy to implement. We derive reward gradients that reflect the structure of
bandit problems and policies, for both non-contextual and contextual settings,
and propose a number of interesting policies that are both differentiable and
have low regret. Our algorithmic and theoretical contributions are supported by
extensive experiments that show the importance of baseline subtraction, learned
biases, and the practicality of our approach on a range problems.
Related papers
- Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems.
Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process.
This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Dual Instrumental Method for Confounded Kernelized Bandits [0.0]
The contextual bandit problem is a framework with wide applications in various fields.
We propose a confounded bandit problem where the noise becomes a latent confounder that affects both contexts and rewards.
We show that a dual instrumental variable regression can correctly identify the true reward function.
arXiv Detail & Related papers (2022-09-07T15:25:57Z) - Risk and optimal policies in bandit experiments [0.0]
This paper provides a decision theoretic analysis of bandit experiments.
The bandit setting corresponds to a dynamic programming problem, but solving this directly is typically infeasible.
For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation.
arXiv Detail & Related papers (2021-12-13T00:41:19Z) - Differentiable Bandit Exploration [38.81737411000074]
We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
arXiv Detail & Related papers (2020-02-17T05:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.