Related papers: Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

URL: http://arxiv.org/abs/2306.01076v2
Date: Sat, 8 Jul 2023 04:29:09 GMT
Title: Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
Authors: Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, Zheng Zhang
Abstract summary: The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models. A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
Score: 12.030179065286928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63\times$ compression ratio, little accuracy loss and remarkable inference and training speedup.

Related papers

Exploring Quantization for Efficient Pre-Training of Transformer Language Models [11.696132057489786]
This study aims to explore the impact of quantization for efficient pre-training of Transformers. By systematically applying straightforward linear quantization to weights, activations, gradients, and states, we assess its effects on model efficiency, stability, and performance during training.
arXiv Detail & Related papers (2024-07-16T13:42:09Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z)
Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z)
Weight subcloning: direct initialization of transformers using larger pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants. Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z)
USDC: Unified Static and Dynamic Compression for Visual Transformer [17.10536016262485]
Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on. However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products. Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large. Several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance.
arXiv Detail & Related papers (2023-10-17T10:04:47Z)
Efficient GPT Model Pre-training using Tensor Train Matrix Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure. The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z)
Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference [83.01121484432801]
We introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression. After a single training phase, Modular Transformers can achieve flexible compression ratios from 1.1x to 6x with little to moderate relative performance drop.
arXiv Detail & Related papers (2023-06-04T15:26:28Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.