Triplet Loss for Knowledge Distillation
- URL: http://arxiv.org/abs/2004.08116v1
- Date: Fri, 17 Apr 2020 08:48:29 GMT
- Title: Triplet Loss for Knowledge Distillation
- Authors: Hideki Oki, Motoshi Abe, Junichi Miyao, Takio Kurita
- Abstract summary: The purpose of knowledge distillation is to increase the similarity between the teacher model and the student model.
In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples.
We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
- Score: 2.683996597055128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, deep learning has spread rapidly, and deeper, larger models
have been proposed. However, the calculation cost becomes enormous as the size
of the models becomes larger. Various techniques for compressing the size of
the models have been proposed to improve performance while reducing
computational costs. One of the methods to compress the size of the models is
knowledge distillation (KD). Knowledge distillation is a technique for
transferring knowledge of deep or ensemble models with many parameters (teacher
model) to smaller shallow models (student model). Since the purpose of
knowledge distillation is to increase the similarity between the teacher model
and the student model, we propose to introduce the concept of metric learning
into knowledge distillation to make the student model closer to the teacher
model using pairs or triplets of the training samples. In metric learning, the
researchers are developing the methods to build a model that can increase the
similarity of outputs for similar samples. Metric learning aims at reducing the
distance between similar and increasing the distance between dissimilar. The
functionality of the metric learning to reduce the differences between similar
outputs can be used for the knowledge distillation to reduce the differences
between the outputs of the teacher model and the student model. Since the
outputs of the teacher model for different objects are usually different, the
student model needs to distinguish them. We think that metric learning can
clarify the difference between the different outputs, and the performance of
the student model could be improved. We have performed experiments to compare
the proposed method with state-of-the-art knowledge distillation methods.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z) - Autoregressive Knowledge Distillation through Imitation Learning [70.12862707908769]
We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
arXiv Detail & Related papers (2020-09-15T17:43:02Z) - Knowledge Distillation in Deep Learning and its Applications [0.6875312133832078]
Deep learning models are relatively large, and it is hard to deploy such models on resource-limited devices.
One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model)
arXiv Detail & Related papers (2020-07-17T14:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.