Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates
- URL: http://arxiv.org/abs/2112.15025v1
- Date: Thu, 30 Dec 2021 12:20:46 GMT
- Title: Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates
- Authors: Safa Alver, Doina Precup
- Abstract summary: We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
- Score: 63.58053355357644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning a good set of policies, so that when
combined together, they can solve a wide variety of unseen reinforcement
learning tasks with no or very little new data. Specifically, we consider the
framework of generalized policy evaluation and improvement, in which the
rewards for all tasks of interest are assumed to be expressible as a linear
combination of a fixed set of features. We show theoretically that, under
certain assumptions, having access to a specific set of diverse policies, which
we call a set of independent policies, can allow for instantaneously achieving
high-level performance on all possible downstream tasks which are typically
more complex than the ones on which the agent was trained. Based on this
theoretical analysis, we propose a simple algorithm that iteratively constructs
this set of policies. In addition to empirically validating our theoretical
results, we compare our approach with recently proposed diverse policy set
construction methods and show that, while others fail, our approach is able to
build a behavior basis that enables instantaneous transfer to all possible
downstream tasks. We also show empirically that having access to a set of
independent policies can better bootstrap the learning process on downstream
tasks where the new reward function cannot be described as a linear combination
of the features. Finally, we demonstrate that this policy set can be useful in
a realistic lifelong reinforcement learning setting.
Related papers
- Iterative Batch Reinforcement Learning via Safe Diversified Model-based Policy Search [2.0072624123275533]
Batch reinforcement learning enables policy learning without direct interaction with the environment during training.
This approach is well-suited for high-risk and cost-intensive applications, such as industrial control.
We present an algorithmic methodology for iterative batch reinforcement learning based on ensemble-based model-based policy search.
arXiv Detail & Related papers (2024-11-14T11:10:36Z) - Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set.
We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - DisCo RL: Distribution-Conditioned Reinforcement Learning for
General-Purpose Policies [116.12670064963625]
We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn contextual policies.
We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.
arXiv Detail & Related papers (2021-04-23T16:51:58Z) - SEERL: Sample Efficient Ensemble Reinforcement Learning [20.983016439055188]
We present a novel training and model selection framework for model-free reinforcement algorithms.
We show that learning and selecting an adequately diverse set of policies is required for a good ensemble.
Our framework is substantially sample efficient, computationally inexpensive and is seen to outperform state-of-the-art (SOTA) scores in Atari 2600 and Mujoco.
arXiv Detail & Related papers (2020-01-15T10:12:00Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.