Neural Architecture Search for Effective Teacher-Student Knowledge
Transfer in Language Models
- URL: http://arxiv.org/abs/2303.09639v2
- Date: Fri, 13 Oct 2023 21:34:39 GMT
- Title: Neural Architecture Search for Effective Teacher-Student Knowledge
Transfer in Language Models
- Authors: Aashka Trivedi, Takuma Udagawa, Michele Merler, Rameswar Panda, Yousef
El-Kurdi, Bishwaranjan Bhattacharjee
- Abstract summary: Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments.
We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task distillation from a multilingual teacher.
Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance.
- Score: 21.177293243968744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained language models have achieved state-of-the-art results on a
variety of downstream tasks. Knowledge Distillation (KD) into a smaller student
model addresses their inefficiency, allowing for deployment in
resource-constrained environments. However, KD can be ineffective when the
student is manually selected from a set of existing options, since it can be a
sub-optimal choice within the space of all possible student architectures. We
develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided
by KD to find the optimal student architecture for task agnostic distillation
from a multilingual teacher. In each episode of the search process, a NAS
controller predicts a reward based on the distillation loss and latency of
inference. The top candidate architectures are then distilled from the teacher
on a small proxy set. Finally the architecture(s) with the highest reward is
selected, and distilled on the full training corpus. KD-NAS can automatically
trade off efficiency and effectiveness, and recommends architectures suitable
to various latency budgets. Using our multi-layer hidden state distillation
process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on
GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance,
and has been deployed in 3 software offerings requiring large throughput, low
latency and deployment on CPU.
Related papers
- SalNAS: Efficient Saliency-prediction Neural Architecture Search with self-knowledge distillation [7.625269122161064]
Recent advancements in deep convolutional neural networks have significantly improved the performance of saliency prediction.
We propose a new Neural Architecture Search framework for saliency prediction with two contributions.
By utilizing Self-KD, SalNAS outperforms other state-of-the-art saliency prediction models in most evaluation rubrics.
arXiv Detail & Related papers (2024-07-29T14:48:34Z) - DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions [121.05720140641189]
We develop a family of models with the distilling neural architecture (DNA) techniques.
Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a sub- search space using algorithms.
Our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively.
arXiv Detail & Related papers (2024-03-02T22:16:47Z) - Meta-prediction Model for Distillation-Aware NAS on Unseen Datasets [55.2118691522524]
Distillation-aware Neural Architecture Search (DaNAS) aims to search for an optimal student architecture.
We propose a distillation-aware meta accuracy prediction model, DaSS (Distillation-aware Student Search), which can predict a given architecture's final performances on a dataset.
arXiv Detail & Related papers (2023-05-26T14:00:35Z) - AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
Distilling Large Language Models [121.22644352431199]
We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model.
Current works train a single SuperLM consisting of millions ofworks with weight-sharing.
Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques.
arXiv Detail & Related papers (2022-01-29T06:13:04Z) - AUTOKD: Automatic Knowledge Distillation Into A Student Architecture
Family [10.51711053229702]
State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models.
While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise.
In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher.
arXiv Detail & Related papers (2021-11-05T15:20:37Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Joint-DetNAS: Upgrade Your Detector with NAS, Pruning and Dynamic
Distillation [49.421099172544196]
We propose Joint-DetNAS, a unified NAS framework for object detection.
Joint-DetNAS integrates 3 key components: Neural Architecture Search, pruning, and Knowledge Distillation.
Our algorithm directly outputs the derived student detector with high performance without additional training.
arXiv Detail & Related papers (2021-05-27T07:25:43Z) - PONAS: Progressive One-shot Neural Architecture Search for Very
Efficient Deployment [9.442139459221783]
We propose Progressive One-shot Neural Architecture Search (PONAS) that combines advantages of progressive NAS and one-shot methods.
PONAS is able to find architecture of a specialized network in around 10 seconds.
In ImageNet classification, 75.2% top-1 accuracy can be obtained, which is comparable with the state of the arts.
arXiv Detail & Related papers (2020-03-11T05:00:31Z) - DDPNAS: Efficient Neural Architecture Search via Dynamic Distribution
Pruning [135.27931587381596]
We propose an efficient and unified NAS framework termed DDPNAS via dynamic distribution pruning.
In particular, we first sample architectures from a joint categorical distribution. Then the search space is dynamically pruned and its distribution is updated every few epochs.
With the proposed efficient network generation method, we directly obtain the optimal neural architectures on given constraints.
arXiv Detail & Related papers (2019-05-28T06:35:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.