Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation
- URL: http://arxiv.org/abs/2009.07453v2
- Date: Tue, 13 Oct 2020 05:23:31 GMT
- Title: Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation
- Authors: Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon
Jeon, Baeseong Park, Sangha Kim and Dongsoo Lee
- Abstract summary: We propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits.
Our model achieves 11.8$times$ smaller model size than the baseline model, with less than -0.5 BLEU.
We achieve 8.3$times$ reduction in run-time memory footprints and 3.5$times$ speed up.
- Score: 9.770173256808844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deployment of widely used Transformer architecture is challenging because
of heavy computation load and memory overhead during inference, especially when
the target device is limited in computational resources such as mobile or edge
devices. Quantization is an effective technique to address such challenges. Our
analysis shows that for a given number of quantization bits, each block of
Transformer contributes to translation quality and inference computations in
different manners. Moreover, even inside an embedding block, each word presents
vastly different contributions. Correspondingly, we propose a mixed precision
quantization strategy to represent Transformer weights by an extremely low
number of bits (e.g., under 3 bits). For example, for each word in an embedding
block, we assign different quantization bits based on statistical property. Our
quantized Transformer model achieves 11.8$\times$ smaller model size than the
baseline model, with less than -0.5 BLEU. We achieve 8.3$\times$ reduction in
run-time memory footprints and 3.5$\times$ speed up (Galaxy N10+) such that our
proposed compression strategy enables efficient implementation for on-device
NMT.
Related papers
- FrameQuant: Flexible Low-Bit Quantization for Transformers [27.93241211038938]
Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower.
We show that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
arXiv Detail & Related papers (2024-03-10T04:01:49Z) - BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models.
Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - MBQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width Network Quantization [51.85834744835766]
We propose MBQuant, a novel method for arbitrary bit-width quantization.
We show that MBQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods.
arXiv Detail & Related papers (2023-05-14T10:17:09Z) - Scaled Quantization for the Vision Transformer [0.0]
Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks.
This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
arXiv Detail & Related papers (2023-03-23T18:31:21Z) - Binarized Neural Machine Translation [43.488431560851204]
We propose a novel binarization technique for Transformers applied to machine translation (BMT)
We identify and address the problem of inflated dot-product variance when using one-bit weights and activations.
Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size.
arXiv Detail & Related papers (2023-02-09T19:27:34Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Understanding and Overcoming the Challenges of Efficient Transformer
Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - Powers of layers for image-to-image translation [60.5529622990682]
We propose a simple architecture to address unpaired image-to-image translation tasks.
We start from an image autoencoder architecture with fixed weights.
For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached.
arXiv Detail & Related papers (2020-08-13T09:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.