Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
- URL: http://arxiv.org/abs/2509.23202v2
- Date: Thu, 16 Oct 2025 09:26:09 GMT
- Title: Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
- Authors: Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh,
- Abstract summary: We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
- Score: 77.67818998672516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
Related papers
- Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation [40.140261007984215]
We improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats.<n>We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications.
arXiv Detail & Related papers (2026-01-30T10:39:11Z) - ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs [4.431548809730958]
ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
arXiv Detail & Related papers (2026-01-12T12:27:22Z) - Block Rotation is All You Need for MXFP4 Quantization [42.603238130671166]
Post-training quantization is a promising solution for efficient deployment of large language models.<n>While most existing methods are designed for INT4 formats, the emergence of MXFP4 raises questions about the applicability of current techniques.<n>We find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4.
arXiv Detail & Related papers (2025-11-06T09:22:31Z) - INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z) - A Comprehensive Evaluation on Quantization Techniques for Large Language Models [46.75040730001041]
Post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead for large language models (LLMs)<n>We conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison.<n>We evaluate and evaluate the latest MXFP4 and NVFP4 data formats and their performance.
arXiv Detail & Related papers (2025-07-23T11:21:21Z) - FP4 All the Way: Fully Quantized Training of LLMs [26.195547788434908]
We demonstrate fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision.<n>We investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods.
arXiv Detail & Related papers (2025-05-25T12:14:25Z) - Oscillation-Reduced MXFP4 Training for Vision Transformers [19.642508885867375]
Pre-training Transformers in FP4 precision comes with a considerable loss of accuracy.<n>Training with MXFP4 data format still results in significant degradation.<n>We propose a novel training method TetraJet for a more accurate FP4 training.
arXiv Detail & Related papers (2025-02-28T08:51:55Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference [6.699442219974261]
AMXFP4 is a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales.<n>AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA.
arXiv Detail & Related papers (2024-11-15T03:11:19Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.