Guarded Policy Optimization with Imperfect Online Demonstrations
- URL: http://arxiv.org/abs/2303.01728v2
- Date: Mon, 24 Apr 2023 03:28:54 GMT
- Title: Guarded Policy Optimization with Imperfect Online Demonstrations
- Authors: Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou
- Abstract summary: Teacher-Student Framework is a reinforcement learning setting where a teacher agent guards the training of a student agent.
It is expensive or even impossible to obtain a well-performing teacher policy.
We develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance.
- Score: 32.22880650876471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Teacher-Student Framework (TSF) is a reinforcement learning setting where
a teacher agent guards the training of a student agent by intervening and
providing online demonstrations. Assuming optimal, the teacher policy has the
perfect timing and capability to intervene in the learning process of the
student agent, providing safety guarantee and exploration guidance.
Nevertheless, in many real-world settings it is expensive or even impossible to
obtain a well-performing teacher policy. In this work, we relax the assumption
of a well-performing teacher and develop a new method that can incorporate
arbitrary teacher policies with modest or inferior performance. We instantiate
an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared
Control (TS2C), which incorporates teacher intervention based on
trajectory-based value estimation. Theoretical analysis validates that the
proposed TS2C algorithm attains efficient exploration and substantial safety
guarantee without being affected by the teacher's own performance. Experiments
on various continuous control tasks show that our method can exploit teacher
policies at different performance levels while maintaining a low training cost.
Moreover, the student policy surpasses the imperfect teacher policy in terms of
higher accumulated reward in held-out testing environments. Code is available
at https://metadriverse.github.io/TS2C.
Related papers
- Policy composition in reinforcement learning via multi-objective policy
optimization [44.23907077052036]
We show that teacher policies can help speed up learning, particularly in the absence of shaping rewards.
In the humanoid domain, we also equip agents with the ability to control the selection of teachers.
arXiv Detail & Related papers (2023-08-29T17:50:27Z) - TGRL: An Algorithm for Teacher Guided Reinforcement Learning [45.38447023752256]
It is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives.
We present a $textitprincipled$ approach, along with an approximate implementation for $textitdynamically$ and $textitautomatically$ balancing when to follow the teacher and when to use rewards.
arXiv Detail & Related papers (2023-07-06T17:58:40Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Self-Training with Differentiable Teacher [80.62757989797095]
Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks.
The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions.
We propose ours, short for differentiable self-training, that treats teacher-student as a Stackelberg game.
arXiv Detail & Related papers (2021-09-15T02:06:13Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Privacy-Preserving Teacher-Student Deep Reinforcement Learning [23.934121758649052]
We develop a private mechanism that protects the privacy of the teacher's training dataset.
We empirically show that the algorithm improves the student's learning upon convergence rate and utility.
arXiv Detail & Related papers (2021-02-18T20:15:09Z) - Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning.
In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment.
The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.