"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
- URL: http://arxiv.org/abs/2411.02355v1
- Date: Mon, 04 Nov 2024 18:21:59 GMT
- Title: "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
- Authors: Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh,
- Abstract summary: We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
- Score: 67.3213104337679
- License:
- Abstract: Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.
Related papers
- Efficient Post-training Quantization with FP8 Formats [14.543387418837154]
We study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures.
E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
arXiv Detail & Related papers (2023-09-26T00:58:36Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant.
ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.