Policy composition in reinforcement learning via multi-objective policy
optimization
- URL: http://arxiv.org/abs/2308.15470v2
- Date: Wed, 30 Aug 2023 17:59:52 GMT
- Title: Policy composition in reinforcement learning via multi-objective policy
optimization
- Authors: Shruti Mishra, Ankit Anand, Jordan Hoffmann, Nicolas Heess, Martin
Riedmiller, Abbas Abdolmaleki, Doina Precup
- Abstract summary: We show that teacher policies can help speed up learning, particularly in the absence of shaping rewards.
In the humanoid domain, we also equip agents with the ability to control the selection of teachers.
- Score: 44.23907077052036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We enable reinforcement learning agents to learn successful behavior policies
by utilizing relevant pre-existing teacher policies. The teacher policies are
introduced as objectives, in addition to the task objective, in a
multi-objective policy optimization setting. Using the Multi-Objective Maximum
a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show
that teacher policies can help speed up learning, particularly in the absence
of shaping rewards. In two domains with continuous observation and action
spaces, our agents successfully compose teacher policies in sequence and in
parallel, and are also able to further extend the policies of the teachers in
order to solve the task.
Depending on the specified combination of task and teacher(s), teacher(s) may
naturally act to limit the final performance of an agent. The extent to which
agents are required to adhere to teacher policies are determined by
hyperparameters which determine both the effect of teachers on learning speed
and the eventual performance of the agent on the task. In the humanoid domain
(Tassa et al. 2018), we also equip agents with the ability to control the
selection of teachers. With this ability, agents are able to meaningfully
compose from the teacher policies to achieve a superior task reward on the walk
task than in cases without access to the teacher policies. We show the
resemblance of composed task policies with the corresponding teacher policies
through videos.
Related papers
- Online Policy Distillation with Decision-Attention [23.807761525617384]
Policy Distillation (PD) has become an effective method to improve deep reinforcement learning tasks.
We study the knowledge transfer between different policies that can learn diverse knowledge from the same environment.
We propose Online Policy Distillation (OPD) with Decision-Attention (DA)
arXiv Detail & Related papers (2024-06-08T14:40:53Z) - Guarded Policy Optimization with Imperfect Online Demonstrations [32.22880650876471]
Teacher-Student Framework is a reinforcement learning setting where a teacher agent guards the training of a student agent.
It is expensive or even impossible to obtain a well-performing teacher policy.
We develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance.
arXiv Detail & Related papers (2023-03-03T06:24:04Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Lifetime policy reuse and the importance of task capacity [6.390849000337326]
Policy reuse and other multi-policy reinforcement learning techniques can learn multiple tasks but may generate many policies.
This paper presents two novel contributions, namely 1) Lifetime Policy Reuse, a model-agnostic policy reuse algorithm that avoids generating many policies.
The results demonstrate the importance of Lifetime Policy Reuse and task capacity based pre-selection on an 18-task partially observable Pacman domain and a Cartpole domain of up to 125 tasks.
arXiv Detail & Related papers (2021-06-03T10:42:49Z) - DisCo RL: Distribution-Conditioned Reinforcement Learning for
General-Purpose Policies [116.12670064963625]
We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn contextual policies.
We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.
arXiv Detail & Related papers (2021-04-23T16:51:58Z) - Privacy-Preserving Teacher-Student Deep Reinforcement Learning [23.934121758649052]
We develop a private mechanism that protects the privacy of the teacher's training dataset.
We empirically show that the algorithm improves the student's learning upon convergence rate and utility.
arXiv Detail & Related papers (2021-02-18T20:15:09Z) - Towards Coordinated Robot Motions: End-to-End Learning of Motion
Policies on Transform Trees [63.31965375413414]
We propose to solve multi-task problems through learning structured policies from human demonstrations.
Our structured policy is inspired by RMPflow, a framework for combining subtask policies on different spaces.
We derive an end-to-end learning objective function that is suitable for the multi-task problem.
arXiv Detail & Related papers (2020-12-24T22:46:22Z) - Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning.
In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment.
The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.