AUTOKD: Automatic Knowledge Distillation Into A Student Architecture
Family
- URL: http://arxiv.org/abs/2111.03555v1
- Date: Fri, 5 Nov 2021 15:20:37 GMT
- Title: AUTOKD: Automatic Knowledge Distillation Into A Student Architecture
Family
- Authors: Roy Henha Eyono, Fabio Maria Carlucci, Pedro M Esperan\c{c}a, Binxin
Ru, Phillip Torr
- Abstract summary: State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models.
While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise.
In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher.
- Score: 10.51711053229702
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: State-of-the-art results in deep learning have been improving steadily, in
good part due to the use of larger models. However, widespread use is
constrained by device hardware limitations, resulting in a substantial
performance gap between state-of-the-art models and those that can be
effectively deployed on small devices. While Knowledge Distillation (KD)
theoretically enables small student models to emulate larger teacher models, in
practice selecting a good student architecture requires considerable human
expertise. Neural Architecture Search (NAS) appears as a natural solution to
this problem but most approaches can be inefficient, as most of the computation
is spent comparing architectures sampled from the same distribution, with
negligible differences in performance. In this paper, we propose to instead
search for a family of student architectures sharing the property of being good
at learning from a given teacher. Our approach AutoKD, powered by Bayesian
Optimization, explores a flexible graph-based search space, enabling us to
automatically learn the optimal student architecture distribution and KD
parameters, while being 20x more sample efficient compared to existing
state-of-the-art. We evaluate our method on 3 datasets; on large images
specifically, we reach the teacher performance while using 3x less memory and
10x less parameters. Finally, while AutoKD uses the traditional KD loss, it
outperforms more advanced KD variants using hand-designed students.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures [4.960025399247103]
Generic Teacher Network (GTN) is a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a finite pool of architectures.
Our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
arXiv Detail & Related papers (2024-07-22T20:34:00Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Neural Architecture Search for Effective Teacher-Student Knowledge
Transfer in Language Models [21.177293243968744]
Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments.
We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task distillation from a multilingual teacher.
Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance.
arXiv Detail & Related papers (2023-03-16T20:39:44Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - CES-KD: Curriculum-based Expert Selection for Guided Knowledge
Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD)
CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum.
Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z) - Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters.
Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK)
Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z) - AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
Distilling Large Language Models [121.22644352431199]
We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model.
Current works train a single SuperLM consisting of millions ofworks with weight-sharing.
Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques.
arXiv Detail & Related papers (2022-01-29T06:13:04Z) - Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources.
We achieve this by designing a compact model architecture that maximally reduces model complexity.
Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.