Related papers: TGRL: An Algorithm for Teacher Guided Reinforcement Learning

TGRL: An Algorithm for Teacher Guided Reinforcement Learning

URL: http://arxiv.org/abs/2307.03186v2
Date: Tue, 20 Feb 2024 04:12:37 GMT
Title: TGRL: An Algorithm for Teacher Guided Reinforcement Learning
Authors: Idan Shenfeld, Zhang-Wei Hong, Aviv Tamar, Pulkit Agrawal
Abstract summary: It is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. We present a $textitprincipled$ approach, along with an approximate implementation for $textitdynamically$ and $textitautomatically$ balancing when to follow the teacher and when to use rewards.
Score: 45.38447023752256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.

Related papers

Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs) Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z)
Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences. Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z)
YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework. It emulates the teacher-student education process to improve the efficacy of model fine-tuning. Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z)
Active teacher selection for reinforcement learning from human feedback [14.009227941725783]
Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback. We propose the Hidden Utility Bandit framework to model differences in teacher rationality, expertise, and costliness. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing.
arXiv Detail & Related papers (2023-10-23T18:54:43Z)
Guarded Policy Optimization with Imperfect Online Demonstrations [32.22880650876471]
Teacher-Student Framework is a reinforcement learning setting where a teacher agent guards the training of a student agent. It is expensive or even impossible to obtain a well-performing teacher policy. We develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance.
arXiv Detail & Related papers (2023-03-03T06:24:04Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Iterative Teacher-Aware Learning [136.05341445369265]
In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency. We propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function.
arXiv Detail & Related papers (2021-10-01T00:27:47Z)
Distribution Matching for Machine Teaching [64.39292542263286]
Machine teaching is an inverse problem of machine learning that aims at steering the student learner towards its target hypothesis. Previous studies on machine teaching focused on balancing the teaching risk and cost to find those best teaching examples. This paper presents a distribution matching-based machine teaching strategy.
arXiv Detail & Related papers (2021-05-06T09:32:57Z)
Active Imitation Learning from Multiple Non-Deterministic Teachers: Formulation, Challenges, and Algorithms [3.6702509833426613]
We formulate the problem of learning to imitate multiple, non-deterministic teachers with minimal interaction cost. We first present a general framework that efficiently models and estimates such a distribution by learning continuous representations of the teacher policies. Next, we develop Active Performance-Based Imitation Learning (APIL), an active learning algorithm for reducing the learner-teacher interaction cost.
arXiv Detail & Related papers (2020-06-14T03:06:27Z)
Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.