TGRL: An Algorithm for Teacher Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2307.03186v2
- Date: Tue, 20 Feb 2024 04:12:37 GMT
- Title: TGRL: An Algorithm for Teacher Guided Reinforcement Learning
- Authors: Idan Shenfeld, Zhang-Wei Hong, Aviv Tamar, Pulkit Agrawal
- Abstract summary: It is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives.
We present a $textitprincipled$ approach, along with an approximate implementation for $textitdynamically$ and $textitautomatically$ balancing when to follow the teacher and when to use rewards.
- Score: 45.38447023752256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from rewards (i.e., reinforcement learning or RL) and learning to
imitate a teacher (i.e., teacher-student learning) are two established
approaches for solving sequential decision-making problems. To combine the
benefits of these different forms of learning, it is common to train a policy
to maximize a combination of reinforcement and teacher-student learning
objectives. However, without a principled method to balance these objectives,
prior work used heuristics and problem-specific hyperparameter searches to
balance the two objectives. We present a $\textit{principled}$ approach, along
with an approximate implementation for $\textit{dynamically}$ and
$\textit{automatically}$ balancing when to follow the teacher and when to use
rewards. The main idea is to adjust the importance of teacher supervision by
comparing the agent's performance to the counterfactual scenario of the agent
learning without teacher supervision and only from rewards. If using teacher
supervision improves performance, the importance of teacher supervision is
increased and otherwise it is decreased. Our method, $\textit{Teacher Guided
Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse
domains without hyper-parameter tuning.
Related papers
- Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences.
Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label.
In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z) - YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework.
It emulates the teacher-student education process to improve the efficacy of model fine-tuning.
Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z) - Active teacher selection for reinforcement learning from human feedback [14.009227941725783]
Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback.
We propose the Hidden Utility Bandit framework to model differences in teacher rationality, expertise, and costliness.
We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing.
arXiv Detail & Related papers (2023-10-23T18:54:43Z) - Guarded Policy Optimization with Imperfect Online Demonstrations [32.22880650876471]
Teacher-Student Framework is a reinforcement learning setting where a teacher agent guards the training of a student agent.
It is expensive or even impossible to obtain a well-performing teacher policy.
We develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance.
arXiv Detail & Related papers (2023-03-03T06:24:04Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Iterative Teacher-Aware Learning [136.05341445369265]
In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency.
We propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function.
arXiv Detail & Related papers (2021-10-01T00:27:47Z) - Distribution Matching for Machine Teaching [64.39292542263286]
Machine teaching is an inverse problem of machine learning that aims at steering the student learner towards its target hypothesis.
Previous studies on machine teaching focused on balancing the teaching risk and cost to find those best teaching examples.
This paper presents a distribution matching-based machine teaching strategy.
arXiv Detail & Related papers (2021-05-06T09:32:57Z) - Active Imitation Learning from Multiple Non-Deterministic Teachers:
Formulation, Challenges, and Algorithms [3.6702509833426613]
We formulate the problem of learning to imitate multiple, non-deterministic teachers with minimal interaction cost.
We first present a general framework that efficiently models and estimates such a distribution by learning continuous representations of the teacher policies.
Next, we develop Active Performance-Based Imitation Learning (APIL), an active learning algorithm for reducing the learner-teacher interaction cost.
arXiv Detail & Related papers (2020-06-14T03:06:27Z) - Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning.
In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment.
The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.