Differentiable Bandit Exploration
- URL: http://arxiv.org/abs/2002.06772v2
- Date: Tue, 9 Jun 2020 07:35:48 GMT
- Title: Differentiable Bandit Exploration
- Authors: Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov,
Csaba Szepesvari, and Manzil Zaheer
- Abstract summary: We learn such policies for an unknown distribution $mathcalP$ using samples from $mathcalP$.
Our approach is a form of meta-learning and exploits properties of $mathcalP$ without making strong assumptions about its form.
- Score: 38.81737411000074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploration policies in Bayesian bandits maximize the average reward over
problem instances drawn from some distribution $\mathcal{P}$. In this work, we
learn such policies for an unknown distribution $\mathcal{P}$ using samples
from $\mathcal{P}$. Our approach is a form of meta-learning and exploits
properties of $\mathcal{P}$ without making strong assumptions about its form.
To do this, we parameterize our policies in a differentiable way and optimize
them by policy gradients, an approach that is general and easy to implement. We
derive effective gradient estimators and introduce novel variance reduction
techniques. We also analyze and experiment with various bandit policy classes,
including neural networks and a novel softmax policy. The latter has regret
guarantees and is a natural starting point for our optimization. Our
experiments show the versatility of our approach. We also observe that neural
network policies can learn implicit biases expressed only through the sampled
instances.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the base policy is equal to $log (n) - (n-1)/n.$
We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Inverse Reinforcement Learning with the Average Reward Criterion [3.719493310637464]
We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion.
The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent.
arXiv Detail & Related papers (2023-05-24T01:12:08Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Meta-Learning Bandit Policies by Gradient Ascent [38.817374110000735]
bandit policies are designed to either minimize regret in any problem instance, or in a Bayesian sense, assuming a prior distribution over environment parameters.
We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances.
We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients.
arXiv Detail & Related papers (2020-06-09T07:45:41Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.