LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- URL: http://arxiv.org/abs/2208.07339v1
- Date: Mon, 15 Aug 2022 17:08:50 GMT
- Title: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
- Abstract summary: We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
- Score: 80.86029795281922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have been widely adopted but require significant GPU
memory for inference. We develop a procedure for Int8 matrix multiplication for
feed-forward and attention projection layers in transformers, which cut the
memory needed for inference by half while retaining full precision performance.
With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted
to Int8, and used immediately without performance degradation. This is made
possible by understanding and working around properties of highly systematic
emergent features in transformer language models that dominate attention and
transformer predictive performance. To cope with these features, we develop a
two-part quantization procedure, LLM.int8(). We first use vector-wise
quantization with separate normalization constants for each inner product in
the matrix multiplication, to quantize most of the features. However, for the
emergent outliers, we also include a new mixed-precision decomposition scheme,
which isolates the outlier feature dimensions into a 16-bit matrix
multiplication while still more than 99.9% of values are multiplied in 8-bit.
Using LLM.int8(), we show empirically it is possible to perform inference in
LLMs with up to 175B parameters without any performance degradation. This
result makes such models much more accessible, for example making it possible
to use OPT-175B/BLOOM on a single server with consumer GPUs.
Related papers
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM.
It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - Learning Accurate Integer Transformer Machine-Translation Models [0.05184427980355132]
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers.
Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
arXiv Detail & Related papers (2020-01-03T18:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.