Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation
- URL: http://arxiv.org/abs/2009.07453v2
- Date: Tue, 13 Oct 2020 05:23:31 GMT
- Title: Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation
- Authors: Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon
Jeon, Baeseong Park, Sangha Kim and Dongsoo Lee
- Abstract summary: We propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits.
Our model achieves 11.8$times$ smaller model size than the baseline model, with less than -0.5 BLEU.
We achieve 8.3$times$ reduction in run-time memory footprints and 3.5$times$ speed up.
- Score: 9.770173256808844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deployment of widely used Transformer architecture is challenging because
of heavy computation load and memory overhead during inference, especially when
the target device is limited in computational resources such as mobile or edge
devices. Quantization is an effective technique to address such challenges. Our
analysis shows that for a given number of quantization bits, each block of
Transformer contributes to translation quality and inference computations in
different manners. Moreover, even inside an embedding block, each word presents
vastly different contributions. Correspondingly, we propose a mixed precision
quantization strategy to represent Transformer weights by an extremely low
number of bits (e.g., under 3 bits). For example, for each word in an embedding
block, we assign different quantization bits based on statistical property. Our
quantized Transformer model achieves 11.8$\times$ smaller model size than the
baseline model, with less than -0.5 BLEU. We achieve 8.3$\times$ reduction in
run-time memory footprints and 3.5$\times$ speed up (Galaxy N10+) such that our
proposed compression strategy enables efficient implementation for on-device
NMT.
Related papers
- BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.
We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - FrameQuant: Flexible Low-Bit Quantization for Transformers [25.569106620123346]
Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks.
Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower.
We show, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
arXiv Detail & Related papers (2024-03-10T04:01:49Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Scaled Quantization for the Vision Transformer [0.0]
Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks.
This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
arXiv Detail & Related papers (2023-03-23T18:31:21Z) - Binarized Neural Machine Translation [43.488431560851204]
We propose a novel binarization technique for Transformers applied to machine translation (BMT)
We identify and address the problem of inflated dot-product variance when using one-bit weights and activations.
Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size.
arXiv Detail & Related papers (2023-02-09T19:27:34Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Understanding and Overcoming the Challenges of Efficient Transformer
Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - Powers of layers for image-to-image translation [60.5529622990682]
We propose a simple architecture to address unpaired image-to-image translation tasks.
We start from an image autoencoder architecture with fixed weights.
For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached.
arXiv Detail & Related papers (2020-08-13T09:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.