ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
- URL: http://arxiv.org/abs/2012.14022v1
- Date: Sun, 27 Dec 2020 22:30:13 GMT
- Title: ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
- Authors: Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu
- Abstract summary: Two neural networks, namely a teacher and a student, are coupled together during training.
The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions.
In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components.
- Score: 30.896957367331137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is considered as a training and compression strategy
in which two neural networks, namely a teacher and a student, are coupled
together during training. The teacher network is supposed to be a trustworthy
predictor and the student tries to mimic its predictions. Usually, a student
with a lighter architecture is selected so we can achieve compression and yet
deliver high-quality results. In such a setting, distillation only happens for
final predictions whereas the student could also benefit from teacher's
supervision for internal components.
Motivated by this, we studied the problem of distillation for intermediate
layers. Since there might not be a one-to-one alignment between student and
teacher layers, existing techniques skip some teacher layers and only distill
from a subset of them. This shortcoming directly impacts quality, so we instead
propose a combinatorial technique which relies on attention. Our model fuses
teacher-side information and takes each layer's significance into
consideration, then performs distillation between combined teacher layers and
those of the student. Using our technique, we distilled a 12-layer BERT (Devlin
et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE
tasks (Wang et al. 2018). Experimental results show that our combinatorial
approach is able to outperform other existing techniques.
Related papers
- Decoupled Knowledge with Ensemble Learning for Online Distillation [3.794605440322862]
Online knowledge distillation is a one-stage strategy that alleviates the requirement with mutual learning and collaborative learning.
Recent peer collaborative learning (PCL) integrates online ensemble, collaboration of base networks and temporal mean teacher to construct effective knowledge.
A decoupled knowledge for online knowledge distillation is generated by an independent teacher, separate from the student.
arXiv Detail & Related papers (2023-12-18T14:08:59Z) - Knowledge Distillation Layer that Lets the Student Decide [6.689381216751284]
We propose a learnable KD layer for the student which improves KD with two distinct abilities.
i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper.
arXiv Detail & Related papers (2023-09-06T09:05:03Z) - Triplet Knowledge Distillation [73.39109022280878]
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn.
To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD.
arXiv Detail & Related papers (2023-05-25T12:12:31Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Switchable Online Knowledge Distillation [68.2673580932132]
Online Knowledge Distillation (OKD) improves involved models by reciprocally exploiting the difference between teacher and student.
We propose Switchable Online Knowledge Distillation (SwitOKD) to answer these questions.
arXiv Detail & Related papers (2022-09-12T03:03:40Z) - Knowledge Distillation with the Reused Teacher Classifier [31.22117343316628]
We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap.
Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
arXiv Detail & Related papers (2022-03-26T06:28:46Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.