Knowledge distillation: A good teacher is patient and consistent
- URL: http://arxiv.org/abs/2106.05237v1
- Date: Wed, 9 Jun 2021 17:20:40 GMT
- Title: Knowledge distillation: A good teacher is patient and consistent
- Authors: Lucas Beyer, Xiaohua Zhai, Am\'elie Royer, Larisa Markeeva, Rohan
Anil, Alexander Kolesnikov
- Abstract summary: There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications.
We identify certain implicit design choices, which may drastically affect the effectiveness of distillation.
We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
- Score: 71.14922743774864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a growing discrepancy in computer vision between large-scale models
that achieve state-of-the-art performance and models that are affordable in
practical applications. In this paper we address this issue and significantly
bridge the gap between these two types of models. Throughout our empirical
investigation we do not aim to necessarily propose a new method, but strive to
identify a robust and effective recipe for making state-of-the-art large scale
models affordable in practice. We demonstrate that, when performed correctly,
knowledge distillation can be a powerful tool for reducing the size of large
models without compromising their performance. In particular, we uncover that
there are certain implicit design choices, which may drastically affect the
effectiveness of distillation. Our key contribution is the explicit
identification of these design choices, which were not previously articulated
in the literature. We back up our findings by a comprehensive empirical study,
demonstrate compelling results on a wide range of vision datasets and, in
particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which
achieves 82.8\% top-1 accuracy.
Related papers
- Distill-then-prune: An Efficient Compression Framework for Real-time Stereo Matching Network on Edge Devices [5.696239274365031]
We propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy.
We obtained a model that maintains real-time performance while delivering high accuracy on edge devices.
arXiv Detail & Related papers (2024-05-20T06:03:55Z) - On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models [7.062887337934677]
We propose that small models may not need to absorb the cost of pre-training to reap its benefits.
We observe that, when distilled on a task from a pre-trained model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task.
arXiv Detail & Related papers (2024-04-04T07:38:11Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - The Quest of Finding the Antidote to Sparse Double Descent [1.336445018915526]
As the model's sparsity increases, the performance first worsens, then improves, and finally deteriorates.
Such a non-monotonic behavior raises serious questions about the optimal model's size to maintain high performance.
We show that a simple $ell$ regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity.
arXiv Detail & Related papers (2023-08-31T09:56:40Z) - Prototype-guided Cross-task Knowledge Distillation for Large-scale
Models [103.04711721343278]
Cross-task knowledge distillation helps to train a small student model to obtain a competitive performance.
We propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios.
arXiv Detail & Related papers (2022-12-26T15:00:42Z) - Self-attention Presents Low-dimensional Knowledge Graph Embeddings for
Link Prediction [6.789370732159177]
Self-attention is the key to applying query-dependant projections to entities and relations.
Our model achieves favorably comparable or better performance than our three best recent state-of-the-art competitors.
arXiv Detail & Related papers (2021-12-20T16:11:01Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.