Localization Distillation for Object Detection
- URL: http://arxiv.org/abs/2102.12252v2
- Date: Thu, 25 Feb 2021 07:23:17 GMT
- Title: Localization Distillation for Object Detection
- Authors: Zhaohui Zheng and Rongguang Ye and Ping Wang and Jun Wang and Dongwei
Ren and Wangmeng Zuo
- Abstract summary: We propose localization distillation (LD) for object detection.
Our LD can be formulated as standard KD by adopting the general localization representation of bounding box.
We suggest a teacher assistant (TA) strategy to fill the possible gap between teacher model and student model.
- Score: 79.78619050578997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has witnessed its powerful ability in learning
compact models in deep learning field, but it is still limited in distilling
localization information for object detection. Existing KD methods for object
detection mainly focus on mimicking deep features between teacher model and
student model, which not only is restricted by specific model architectures,
but also cannot distill localization ambiguity. In this paper, we first propose
localization distillation (LD) for object detection. In particular, our LD can
be formulated as standard KD by adopting the general localization
representation of bounding box. Our LD is very flexible, and is applicable to
distill localization ambiguity for arbitrary architecture of teacher model and
student model. Moreover, it is interesting to find that Self-LD, i.e.,
distilling teacher model itself, can further boost state-of-the-art
performance. Second, we suggest a teacher assistant (TA) strategy to fill the
possible gap between teacher model and student model, by which the distillation
effectiveness can be guaranteed even the selected teacher model is not optimal.
On benchmark datasets PASCAL VOC and MS COCO, our LD can consistently improve
the performance for student detectors, and also boosts state-of-the-art
detectors notably. Our source code and trained models are publicly available at
https://github.com/HikariTJU/LD
Related papers
- Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model.
We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP)
Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits.
We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student.
We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z) - Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features.
The target model aims to mimic the local structure of the source representation space using a contrastive loss.
CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z) - Causal Distillation for Language Models [23.68246698789134]
We show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher.
Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia.
arXiv Detail & Related papers (2021-12-05T08:13:09Z) - Oracle Teacher: Leveraging Target Information for Better Knowledge
Distillation of CTC Models [10.941519846908697]
We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher.
Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance.
Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
arXiv Detail & Related papers (2021-11-05T14:14:05Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.