Related papers: INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

URL: http://arxiv.org/abs/2510.25602v1
Date: Wed, 29 Oct 2025 15:11:53 GMT
Title: INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Authors: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo,
Abstract summary: Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
Score: 51.72056104795248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

Related papers

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs [4.431548809730958]
ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
arXiv Detail & Related papers (2026-01-12T12:27:22Z)
Defeating the Training-Inference Mismatch via FP16 [72.25890308541334]
Reinforcement learning (RL) fine-tuning often suffers from instability due to the numerical mismatch between the training and inference policies.<n>We show that its root cause lies in the floating point precision itself.<n>The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference.
arXiv Detail & Related papers (2025-10-30T17:58:11Z)
Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields [51.95157731126864]
Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost.<n>This thesis aims to make MACE cheaper and faster by identifying computational bottlenecks and evaluating low-precision execution policies.
arXiv Detail & Related papers (2025-10-23T14:02:34Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs) Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z)
FP8 versus INT8 for efficient deep learning inference [14.98281493168929]
We compare the performance for both the FP8 and INT formats for efficient on-device inference. We show that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. We conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8.
arXiv Detail & Related papers (2023-03-31T10:29:17Z)
FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings. E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.