Related papers: AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

URL: http://arxiv.org/abs/2201.08539v1
Date: Fri, 21 Jan 2022 04:32:19 GMT
Title: AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
Authors: Xiaofan Zhang, Zongwei Zhou, Deming Chen, Yu Emma Wang
Abstract summary: We propose AutoDistill, an end-to-end model distillation framework for building hardware-efficient NLP pre-trained models. Experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score.
Score: 20.04008357406888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.

Related papers

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Achieving Pareto Optimality using Efficient Parameter Reduction for DNNs in Resource-Constrained Edge Environment [1.9055921262476347]
This paper proposes an optimization of an existing Deep Neural Network (DNN) that improves its hardware utilization and facilitates on-device training for resource-constrained edge environments. We implement efficient parameter reduction strategies on Xception that shrink the model size without sacrificing accuracy, thus decreasing memory utilization during training.
arXiv Detail & Related papers (2024-03-14T19:40:58Z)
Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments [34.80538284957094]
This paper introduces a lightweight framework for resource-efficient driver activity recognition. The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification. It achieves a threefold reduction in model size and a 1.4-fold improvement in inference time.
arXiv Detail & Related papers (2023-11-10T10:07:07Z)
Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z)
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B. Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z)
FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid? [7.813154720635396]
We propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models.
arXiv Detail & Related papers (2022-05-23T22:44:34Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z)
EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation [82.3956677850676]
Pre-trained language models have shown remarkable results on various NLP tasks. Due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA)
arXiv Detail & Related papers (2021-09-15T11:25:39Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.