Block Rotation is All You Need for MXFP4 Quantization
- URL: http://arxiv.org/abs/2511.04214v1
- Date: Thu, 06 Nov 2025 09:22:31 GMT
- Title: Block Rotation is All You Need for MXFP4 Quantization
- Authors: Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng,
- Abstract summary: Post-training quantization is a promising solution for efficient deployment of large language models.<n>While most existing methods are designed for INT4 formats, the emergence of MXFP4 raises questions about the applicability of current techniques.<n>We find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4.
- Score: 42.603238130671166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.
Related papers
- Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats [23.57507112139113]
Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs)<n>Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization.<n>This work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families.
arXiv Detail & Related papers (2026-01-14T15:16:55Z) - ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs [4.431548809730958]
ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
arXiv Detail & Related papers (2026-01-12T12:27:22Z) - INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z) - Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z) - A Comprehensive Evaluation on Quantization Techniques for Large Language Models [46.75040730001041]
Post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead for large language models (LLMs)<n>We conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison.<n>We evaluate and evaluate the latest MXFP4 and NVFP4 data formats and their performance.
arXiv Detail & Related papers (2025-07-23T11:21:21Z) - FP4 All the Way: Fully Quantized Training of LLMs [26.195547788434908]
We demonstrate fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision.<n>We investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods.
arXiv Detail & Related papers (2025-05-25T12:14:25Z) - Quartet: Native FP4 Training Can Be Optimal for Large Language Models [27.800012997794987]
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs.<n> NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants.<n>We introduce a new approach for accurate, end-to-end FP4 training with all the major computations in low precision.
arXiv Detail & Related papers (2025-05-20T17:55:50Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference [6.699442219974261]
AMXFP4 is a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales.<n>AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA.
arXiv Detail & Related papers (2024-11-15T03:11:19Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.