Related papers: Guarded Policy Optimization with Imperfect Online Demonstrations

Guarded Policy Optimization with Imperfect Online Demonstrations

URL: http://arxiv.org/abs/2303.01728v2
Date: Mon, 24 Apr 2023 03:28:54 GMT
Title: Guarded Policy Optimization with Imperfect Online Demonstrations
Authors: Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou
Abstract summary: Teacher-Student Framework is a reinforcement learning setting where a teacher agent guards the training of a student agent. It is expensive or even impossible to obtain a well-performing teacher policy. We develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance.
Score: 32.22880650876471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.

Related papers

Distilling Realizable Students from Unrealizable Teachers [9.968083244726941]
We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access.<n>Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently.<n>We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration.
arXiv Detail & Related papers (2025-05-14T16:45:51Z)
Policy composition in reinforcement learning via multi-objective policy optimization [44.23907077052036]
We show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In the humanoid domain, we also equip agents with the ability to control the selection of teachers.
arXiv Detail & Related papers (2023-08-29T17:50:27Z)
TGRL: An Algorithm for Teacher Guided Reinforcement Learning [45.38447023752256]
It is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. We present a $textitprincipled$ approach, along with an approximate implementation for $textitdynamically$ and $textitautomatically$ balancing when to follow the teacher and when to use rewards.
arXiv Detail & Related papers (2023-07-06T17:58:40Z)
Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Self-Training with Differentiable Teacher [80.62757989797095]
Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks. The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions. We propose ours, short for differentiable self-training, that treats teacher-student as a Stackelberg game.
arXiv Detail & Related papers (2021-09-15T02:06:13Z)
Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student. IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z)
Privacy-Preserving Teacher-Student Deep Reinforcement Learning [23.934121758649052]
We develop a private mechanism that protects the privacy of the teacher's training dataset. We empirically show that the algorithm improves the student's learning upon convergence rate and utility.
arXiv Detail & Related papers (2021-02-18T20:15:09Z)
Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.