QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
- URL: http://arxiv.org/abs/2210.17114v3
- Date: Wed, 10 May 2023 12:57:33 GMT
- Title: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
- Authors: Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen
- Abstract summary: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized.
A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding.
Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss.
We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
- Score: 5.36703735486629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Limited computational budgets often prevent transformers from being used in
production and from having their high accuracy utilized. A knowledge
distillation approach addresses the computational efficiency by self-distilling
BERT into a smaller transformer representation having fewer layers and smaller
internal embedding. However, the performance of these models drops as we reduce
the number of layers, notably in advanced NLP tasks such as span question
answering. In addition, a separate model must be trained for each inference
scenario with its distinct computational budget. Dynamic-TinyBERT tackles both
limitations by partially implementing the Length Adaptive Transformer (LAT)
technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal
accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to
generate a much more highly efficient model. We use MiniLM distillation jointly
with the LAT method, and we further enhance the efficiency by applying low-bit
quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is
trained only once, dynamically fits any inference scenario, and achieves an
accuracy-efficiency trade-off superior to any other efficient approaches per
any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1%
accuracy loss). The code to reproduce this work is publicly available on
Github.
Related papers
- HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model [92.55145016562867]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers [12.432191400869002]
MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student.
We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
arXiv Detail & Related papers (2022-10-12T17:23:21Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model.
The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model.
We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z) - Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic
Sequence Length [2.8770761243361593]
TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation.
Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches.
arXiv Detail & Related papers (2021-11-18T11:58:19Z) - DACT-BERT: Differentiable Adaptive Computation Time for an Efficient
BERT Inference [3.375478015832455]
We propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models.
DACT-BERT adds an adaptive computational mechanism to BERT's regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time.
Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
arXiv Detail & Related papers (2021-09-24T04:45:55Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.