QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
- URL: http://arxiv.org/abs/2210.17114v3
- Date: Wed, 10 May 2023 12:57:33 GMT
- Title: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
- Authors: Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen
- Abstract summary: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized.
A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding.
Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss.
We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
- Score: 5.36703735486629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Limited computational budgets often prevent transformers from being used in
production and from having their high accuracy utilized. A knowledge
distillation approach addresses the computational efficiency by self-distilling
BERT into a smaller transformer representation having fewer layers and smaller
internal embedding. However, the performance of these models drops as we reduce
the number of layers, notably in advanced NLP tasks such as span question
answering. In addition, a separate model must be trained for each inference
scenario with its distinct computational budget. Dynamic-TinyBERT tackles both
limitations by partially implementing the Length Adaptive Transformer (LAT)
technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal
accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to
generate a much more highly efficient model. We use MiniLM distillation jointly
with the LAT method, and we further enhance the efficiency by applying low-bit
quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is
trained only once, dynamically fits any inference scenario, and achieves an
accuracy-efficiency trade-off superior to any other efficient approaches per
any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1%
accuracy loss). The code to reproduce this work is publicly available on
Github.
Related papers
- IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models.
IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models [0.0]
Transformer Layer Injection (TLI) is a novel method for efficiently upscaling large language models (LLMs)
Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers.
arXiv Detail & Related papers (2024-10-15T14:41:44Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers [12.432191400869002]
MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student.
We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
arXiv Detail & Related papers (2022-10-12T17:23:21Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model.
The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model.
We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z) - Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic
Sequence Length [2.8770761243361593]
TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation.
Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches.
arXiv Detail & Related papers (2021-11-18T11:58:19Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.