Related papers: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

URL: http://arxiv.org/abs/2210.17114v3
Date: Wed, 10 May 2023 12:57:33 GMT
Title: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
Authors: Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen
Abstract summary: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
Score: 5.36703735486629
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work is publicly available on Github.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z)
Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models [0.0]
Transformer Layer Injection (TLI) is a novel method for efficiently upscaling large language models (LLMs) Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers.
arXiv Detail & Related papers (2024-10-15T14:41:44Z)
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models. We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers [12.432191400869002]
MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student. We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
arXiv Detail & Related papers (2022-10-12T17:23:21Z)
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)
TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z)
Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length [2.8770761243361593]
TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation. Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches.
arXiv Detail & Related papers (2021-11-18T11:58:19Z)
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference. We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT. TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.