Multi-Task Off-Policy Learning from Bandit Feedback
- URL: http://arxiv.org/abs/2212.04720v1
- Date: Fri, 9 Dec 2022 08:26:27 GMT
- Title: Multi-Task Off-Policy Learning from Bandit Feedback
- Authors: Joey Hong and Branislav Kveton and Sumeet Katariya and Manzil Zaheer
and Mohammad Ghavamzadeh
- Abstract summary: We propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them.
We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model.
Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
- Score: 54.96011624223482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many practical applications, such as recommender systems and learning to
rank, involve solving multiple similar tasks. One example is learning of
recommendation policies for users with similar movie preferences, where the
users may still rank the individual movies slightly differently. Such tasks can
be organized in a hierarchy, where similar tasks are related through a shared
structure. In this work, we formulate this problem as a contextual off-policy
optimization in a hierarchical graphical model from logged bandit feedback. To
solve the problem, we propose a hierarchical off-policy optimization algorithm
(HierOPO), which estimates the parameters of the hierarchical model and then
acts pessimistically with respect to them. We instantiate HierOPO in linear
Gaussian models, for which we also provide an efficient implementation and
analysis. We prove per-task bounds on the suboptimality of the learned
policies, which show a clear improvement over not using the hierarchical model.
We also evaluate the policies empirically. Our theoretical and empirical
results show a clear advantage of using the hierarchy over solving each task
independently.
Related papers
- On the benefits of pixel-based hierarchical policies for task generalization [7.207480346660617]
Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces.
We analyze the benefits of hierarchy through simulated multi-task robotic control experiments from pixels.
arXiv Detail & Related papers (2024-07-27T01:26:26Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - Planning with a Learned Policy Basis to Optimally Solve Complex Tasks [26.621462241759133]
We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem.
In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning.
arXiv Detail & Related papers (2024-03-22T15:51:39Z) - Pessimistic Off-Policy Optimization for Learning to Rank [13.733459243449634]
Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
arXiv Detail & Related papers (2022-06-06T12:58:28Z) - Deep Hierarchy in Bandits [51.22833900944146]
Mean rewards of actions are often correlated.
To maximize statistical efficiency, it is important to leverage these correlations when learning.
We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model.
arXiv Detail & Related papers (2022-02-03T08:15:53Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - How Fine-Tuning Allows for Effective Meta-Learning [50.17896588738377]
We present a theoretical framework for analyzing representations derived from a MAML-like algorithm.
We provide risk bounds on the best predictor found by fine-tuning via gradient descent, demonstrating that the algorithm can provably leverage the shared structure.
This separation result underscores the benefit of fine-tuning-based methods, such as MAML, over methods with "frozen representation" objectives in few-shot learning.
arXiv Detail & Related papers (2021-05-05T17:56:00Z) - Hierarchical Variational Imitation Learning of Control Programs [131.7671843857375]
We propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP)
Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations.
We demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods.
arXiv Detail & Related papers (2019-12-29T08:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.