Sparse Teachers Can Be Dense with Knowledge
- URL: http://arxiv.org/abs/2210.03923v1
- Date: Sat, 8 Oct 2022 05:25:34 GMT
- Title: Sparse Teachers Can Be Dense with Knowledge
- Authors: Yi Yang, Chen Zhang, Dawei Song
- Abstract summary: We propose a sparse teacher trick under the guidance of an overall knowledgable score for each teacher parameter.
The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed.
Experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance.
- Score: 35.83646432932867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in distilling pretrained language models have discovered
that, besides the expressiveness of knowledge, the student-friendliness should
be taken into consideration to realize a truly knowledgable teacher. Based on a
pilot study, we find that over-parameterized teachers can produce expressive
yet student-unfriendly knowledge, and are thus limited in overall
knowledgableness. To remove the parameters that result in
student-unfriendliness, we propose a sparse teacher trick under the guidance of
an overall knowledgable score for each teacher parameter. The knowledgable
score is essentially an interpolation of the expressiveness and
student-friendliness scores. The aim is to ensure that the expressive
parameters are retained while the student-unfriendly ones are removed.
Extensive experiments on the GLUE benchmark show that the proposed sparse
teachers can be dense with knowledge and lead to students with compelling
performance in comparison with a series of competitive baselines.
Related papers
- Student-Informed Teacher Training [19.895253502371588]
Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images.
In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations.
We propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student.
arXiv Detail & Related papers (2024-12-12T10:34:26Z) - Representational Alignment Supports Effective Machine Teaching [81.19197059407121]
GRADE is a new controlled experimental setting to study pedagogy and representational alignment.
We find that improved representational alignment with a student improves student learning outcomes.
However, this effect is moderated by the size and representational diversity of the class being taught.
arXiv Detail & Related papers (2024-06-06T17:48:24Z) - Can Language Models Teach Weaker Agents? Teacher Explanations Improve
Students via Personalization [84.86241161706911]
We show that teacher LLMs can indeed intervene on student reasoning to improve their performance.
We also demonstrate that in multi-turn interactions, teacher explanations generalize and learn from explained data.
We verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.
arXiv Detail & Related papers (2023-06-15T17:27:20Z) - Computationally Identifying Funneling and Focusing Questions in
Classroom Discourse [24.279653100481863]
We propose the task of computationally detecting funneling and focusing questions in classroom discourse.
We release an annotated dataset of 2,348 teacher utterances labeled for funneling and focusing questions, or neither.
Our best model, a supervised RoBERTa model fine-tuned on our dataset, has a strong linear correlation of.76 with human expert labels and with positive educational outcomes.
arXiv Detail & Related papers (2022-07-08T01:28:29Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.