AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
Distilling Large Language Models
- URL: http://arxiv.org/abs/2201.12507v1
- Date: Sat, 29 Jan 2022 06:13:04 GMT
- Title: AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
Distilling Large Language Models
- Authors: Dongkuan Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey,
Wenhui Wang, Xiang Zhang, Ahmed Hassan Awadallah, Jianfeng Gao
- Abstract summary: We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model.
Current works train a single SuperLM consisting of millions ofworks with weight-sharing.
Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques.
- Score: 121.22644352431199
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge distillation (KD) methods compress large models into smaller
students with manually-designed student architectures given pre-specified
computational cost. This requires several trials to find a viable student, and
further repeating the process for each student or computational budget change.
We use Neural Architecture Search (NAS) to automatically distill several
compressed students with variable cost from a large model. Current works train
a single SuperLM consisting of millions of subnetworks with weight-sharing,
resulting in interference between subnetworks of different sizes. Our framework
AutoDistil addresses above challenges with the following steps: (a)
Incorporates inductive bias and heuristics to partition Transformer search
space into K compact sub-spaces (K=3 for typical student sizes of base, small
and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic
objective (e.g., self-attention distillation) with weight-sharing of students;
(c) Lightweight search for the optimal student without re-training. Fully
task-agnostic training and search allow students to be reused for fine-tuning
on any downstream task. Experiments on GLUE benchmark against state-of-the-art
KD and NAS methods demonstrate AutoDistil to outperform leading compression
techniques with upto 2.7x reduction in computational cost and negligible loss
in task performance.
Related papers
- DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions [121.05720140641189]
We develop a family of models with the distilling neural architecture (DNA) techniques.
Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a sub- search space using algorithms.
Our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively.
arXiv Detail & Related papers (2024-03-02T22:16:47Z) - RdimKD: Generic Distillation Paradigm by Dimensionality Reduction [16.977144350795488]
Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices.
In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD)
RdimKD solely relies on dimensionality reduction, with a very minor modification to naive L2 loss.
arXiv Detail & Related papers (2023-12-14T07:34:08Z) - Neural Architecture Search for Effective Teacher-Student Knowledge
Transfer in Language Models [21.177293243968744]
Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments.
We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task distillation from a multilingual teacher.
Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance.
arXiv Detail & Related papers (2023-03-16T20:39:44Z) - DiSparse: Disentangled Sparsification for Multitask Model Compression [92.84435347164435]
DiSparse is a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme.
Our experimental results demonstrate superior performance on various configurations and settings.
arXiv Detail & Related papers (2022-06-09T17:57:46Z) - Efficient Architecture Search for Diverse Tasks [29.83517145790238]
We study neural architecture search (NAS) for efficiently solving diverse problems.
We introduce DASH, a differentiable NAS algorithm that computes the mixture-of-operations using the Fourier diagonalization of convolution.
We evaluate DASH-Bench-360, a suite of ten tasks designed for NAS benchmarking in diverse domains.
arXiv Detail & Related papers (2022-04-15T17:21:27Z) - Elastic Architecture Search for Diverse Tasks with Different Resources [87.23061200971912]
We study a new challenging problem of efficient deployment for diverse tasks with different resources, where the resource constraint and task of interest corresponding to a group of classes are dynamically specified at testing time.
Previous NAS approaches seek to design architectures for all classes simultaneously, which may not be optimal for some individual tasks.
We present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse tasks with various resource constraints.
arXiv Detail & Related papers (2021-08-03T00:54:27Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - Joint-DetNAS: Upgrade Your Detector with NAS, Pruning and Dynamic
Distillation [49.421099172544196]
We propose Joint-DetNAS, a unified NAS framework for object detection.
Joint-DetNAS integrates 3 key components: Neural Architecture Search, pruning, and Knowledge Distillation.
Our algorithm directly outputs the derived student detector with high performance without additional training.
arXiv Detail & Related papers (2021-05-27T07:25:43Z) - Teachers Do More Than Teach: Compressing Image-to-Image Models [35.40756344110666]
Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images.
GANs suffer from low efficiency due to tremendous computational cost and bulky memory usage.
Recent efforts on compression GANs show noticeable progress in obtaining smaller generators.
arXiv Detail & Related papers (2021-03-05T04:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.