Random Teachers are Good Teachers
- URL: http://arxiv.org/abs/2302.12091v2
- Date: Mon, 19 Jun 2023 12:49:34 GMT
- Title: Random Teachers are Good Teachers
- Authors: Felix Sarnthein, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann
- Abstract summary: We investigate the implicit regularization induced by teacher-student learning dynamics in self-distillation.
When distilling a student into such a random teacher, we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy.
- Score: 19.74244993871716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we investigate the implicit regularization induced by
teacher-student learning dynamics in self-distillation. To isolate its effect,
we describe a simple experiment where we consider teachers at random
initialization instead of trained teachers. Surprisingly, when distilling a
student into such a random teacher, we observe that the resulting model and its
representations already possess very interesting characteristics; (1) we
observe a strong improvement of the distilled student over its teacher in terms
of probing accuracy. (2) The learned representations are data-dependent and
transferable between different tasks but deteriorate strongly if trained on
random inputs. (3) The student checkpoint contains sparse subnetworks,
so-called lottery tickets, and lies on the border of linear basins in the
supervised loss landscape. These observations have interesting consequences for
several important areas in machine learning: (1) Self-distillation can work
solely based on the implicit regularization present in the gradient dynamics
without relying on any dark knowledge, (2) self-supervised learning can learn
features even in the absence of data augmentation and (3) training dynamics
during the early phase of supervised training do not necessarily require label
information. Finally, we shed light on an intriguing local property of the loss
landscape: the process of feature learning is strongly amplified if the student
is initialized closely to the teacher. These results raise interesting
questions about the nature of the landscape that have remained unexplored so
far. Code is available at https://github.com/safelix/dinopl.
Related papers
- Progressive distillation induces an implicit curriculum [44.528775476168654]
A better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several teachers.
One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher.
Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning.
arXiv Detail & Related papers (2024-10-07T19:49:24Z) - Learn to Teach: Improve Sample Efficiency in Teacher-student Learning
for Sim-to-Real Transfer [5.731477362725785]
We propose a sample efficient learning framework termed Learn to Teach (L2T) that recycles experience collected by the teacher agent.
We show that a single-loop algorithm can train both the teacher and student agents under both Reinforcement Learning and Inverse Reinforcement Learning contexts.
arXiv Detail & Related papers (2024-02-09T21:16:43Z) - YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework.
It emulates the teacher-student education process to improve the efficacy of model fine-tuning.
Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z) - L2T-DLN: Learning to Teach with Dynamic Loss Network [4.243592852049963]
In existing works, the teacher iteration model 1) merely determines the loss function based on the present states of the student model.
In this paper, we first formulate the loss adjustment as a temporal task by designing a teacher model with memory units.
Then, with a dynamic loss network, we can additionally use the states of the loss to assist the teacher learning in enhancing the interactions between the teacher and the student model.
arXiv Detail & Related papers (2023-10-30T07:21:40Z) - UNIKD: UNcertainty-filtered Incremental Knowledge Distillation for Neural Implicit Representation [48.49860868061573]
Recent neural implicit representations (NIRs) have achieved great success in the tasks of 3D reconstruction and novel view synthesis.
They require the images of a scene from different camera views to be available for one-time training.
This is expensive especially for scenarios with large-scale scenes and limited data storage.
We design a student-teacher framework to mitigate the catastrophic problem.
arXiv Detail & Related papers (2022-12-21T11:43:20Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Know Thy Student: Interactive Learning with Gaussian Processes [11.641731210416102]
Our work proposes a simple diagnosis algorithm which uses Gaussian processes for inferring student-related information, before constructing a teaching dataset.
We study this in the offline reinforcement learning setting where the teacher must provide demonstrations to the student and avoid sending redundant trajectories.
Our experiments highlight the importance of diagosing before teaching and demonstrate how students can learn more efficiently with the help of an interactive teacher.
arXiv Detail & Related papers (2022-04-26T04:43:57Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Student-Teacher Learning from Clean Inputs to Noisy Inputs [20.428469418957544]
Feature-based student-teacher learning is empirically successful in transferring the knowledge from a pre-trained teacher network to the student network.
We analyze this method theoretically using deep linear networks, and experimentally using nonlinear networks.
arXiv Detail & Related papers (2021-03-13T02:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.