GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers
- URL: http://arxiv.org/abs/2210.17323v2
- Date: Wed, 22 Mar 2023 13:10:47 GMT
- Title: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers
- Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
- Abstract summary: GPTQ is a new one-shot weight quantization method based on approximate second-order information.
It can quantize GPT models with 175 billion parameters in approximately four GPU hours.
Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods.
- Score: 34.91478831993398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set
themselves apart through breakthrough performance across complex language
modelling tasks, but also by their extremely high computational and storage
costs. Specifically, due to their massive size, even inference for large,
highly-accurate GPT models may require multiple performant GPUs, which limits
the usability of such models. While there is emerging work on relieving this
pressure via model compression, the applicability and performance of existing
compression techniques is limited by the scale and complexity of GPT models. In
this paper, we address this challenge, and propose GPTQ, a new one-shot weight
quantization method based on approximate second-order information, that is both
highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT
models with 175 billion parameters in approximately four GPU hours, reducing
the bitwidth down to 3 or 4 bits per weight, with negligible accuracy
degradation relative to the uncompressed baseline. Our method more than doubles
the compression gains relative to previously-proposed one-shot quantization
methods, preserving accuracy, allowing us for the first time to execute an 175
billion-parameter model inside a single GPU for generative inference. Moreover,
we also show that our method can still provide reasonable accuracy in the
extreme quantization regime, in which weights are quantized to 2-bit or even
ternary quantization levels. We show experimentally that these improvements can
be leveraged for end-to-end inference speedups over FP16, of around 3.25x when
using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones
(NVIDIA A6000). The implementation is available at
https://github.com/IST-DASLab/gptq.
Related papers
- GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points [10.238677144792279]
decoupleQ abandons the traditional quantization paradigm and decouples the model parameters into integer and floating-point parts.
Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
arXiv Detail & Related papers (2024-04-19T10:02:53Z) - QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.