Controlling the Quality of Distillation in Response-Based Network
Compression
- URL: http://arxiv.org/abs/2112.10047v1
- Date: Sun, 19 Dec 2021 02:53:51 GMT
- Title: Controlling the Quality of Distillation in Response-Based Network
Compression
- Authors: Vibhas Vats and David Crandall
- Abstract summary: The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of a distillation-based compressed network is governed by the
quality of distillation. The reason for the suboptimal distillation of a large
network (teacher) to a smaller network (student) is largely attributed to the
gap in the learning capacities of given teacher-student pair. While it is hard
to distill all the knowledge of a teacher, the quality of distillation can be
controlled to a large extent to achieve better performance. Our experiments
show that the quality of distillation is largely governed by the quality of
teacher's response, which in turn is heavily affected by the presence of
similarity information in its response. A well-trained large capacity teacher
loses similarity information between classes in the process of learning
fine-grained discriminative properties for classification. The absence of
similarity information causes the distillation process to be reduced from one
example-many class learning to one example-one class learning, thereby
throttling the flow of diverse knowledge from the teacher. With the implicit
assumption that only the instilled knowledge can be distilled, instead of
focusing only on the knowledge distilling process, we scrutinize the knowledge
inculcation process. We argue that for a given teacher-student pair, the
quality of distillation can be improved by finding the sweet spot between batch
size and number of epochs while training the teacher. We discuss the steps to
find this sweet spot for better distillation. We also propose the distillation
hypothesis to differentiate the behavior of the distillation process between
knowledge distillation and regularization effect. We conduct all our
experiments on three different datasets.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - A Survey on Recent Teacher-student Learning Studies [0.0]
Knowledge distillation is a method of transferring the knowledge from a complex deep neural network (DNN) to a smaller and faster DNN.
Recent variants of knowledge distillation include teaching assistant distillation, curriculum distillation, mask distillation, and decoupling distillation.
arXiv Detail & Related papers (2023-04-10T14:30:28Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student)
Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data.
We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z) - Spot-adaptive Knowledge Distillation [39.23627955442595]
We propose a new distillation strategy, termed spot-adaptive KD (SAKD)
SAKD adaptively determines the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period.
Experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD.
arXiv Detail & Related papers (2022-05-05T02:21:32Z) - Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.