Knowledge Distillation in Deep Learning and its Applications
- URL: http://arxiv.org/abs/2007.09029v1
- Date: Fri, 17 Jul 2020 14:43:52 GMT
- Title: Knowledge Distillation in Deep Learning and its Applications
- Authors: Abdolmaged Alkhulaifi, Fahad Alsahli, Irfan Ahmad
- Abstract summary: Deep learning models are relatively large, and it is hard to deploy such models on resource-limited devices.
One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model)
- Score: 0.6875312133832078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning based models are relatively large, and it is hard to deploy
such models on resource-limited devices such as mobile phones and embedded
devices. One possible solution is knowledge distillation whereby a smaller
model (student model) is trained by utilizing the information from a larger
model (teacher model). In this paper, we present a survey of knowledge
distillation techniques applied to deep learning models. To compare the
performances of different techniques, we propose a new metric called
distillation metric. Distillation metric compares different knowledge
distillation algorithms based on sizes and accuracy scores. Based on the
survey, some interesting conclusions are drawn and presented in this paper.
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - Multi-teacher knowledge distillation as an effective method for
compressing ensembles of neural networks [0.0]
Large-scale deep models have achieved great success, but the enormous computational complexity and gigantic storage requirements make them difficult to implement in real-time applications.
We present a modified knowledge distillation framework which allows compressing the entire ensemble model into a weight space of a single model.
We show that knowledge distillation can aggregate knowledge from multiple teachers in only one student model and, with the same computational complexity, obtain a better-performing model compared to a model trained in the standard manner.
arXiv Detail & Related papers (2023-02-14T17:40:36Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model.
We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory.
We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Knowledge Distillation: A Survey [87.51063304509067]
Deep neural networks have been successful in both industry and academia, especially for computer vision tasks.
It is a challenge to deploy these cumbersome deep models on devices with limited resources.
Knowledge distillation effectively learns a small student model from a large teacher model.
arXiv Detail & Related papers (2020-06-09T21:47:17Z) - Triplet Loss for Knowledge Distillation [2.683996597055128]
The purpose of knowledge distillation is to increase the similarity between the teacher model and the student model.
In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples.
We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
arXiv Detail & Related papers (2020-04-17T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.