Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
- URL: http://arxiv.org/abs/2601.09555v1
- Date: Wed, 14 Jan 2026 15:16:55 GMT
- Title: Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
- Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu,
- Abstract summary: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs)<n>Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization.<n>This work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families.
- Score: 23.57507112139113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
Related papers
- Block Rotation is All You Need for MXFP4 Quantization [42.603238130671166]
Post-training quantization is a promising solution for efficient deployment of large language models.<n>While most existing methods are designed for INT4 formats, the emergence of MXFP4 raises questions about the applicability of current techniques.<n>We find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4.
arXiv Detail & Related papers (2025-11-06T09:22:31Z) - INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z) - Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z) - A Comprehensive Evaluation on Quantization Techniques for Large Language Models [46.75040730001041]
Post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead for large language models (LLMs)<n>We conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison.<n>We evaluate and evaluate the latest MXFP4 and NVFP4 data formats and their performance.
arXiv Detail & Related papers (2025-07-23T11:21:21Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Post Training Quantization of Large Language Models with Microscaling Formats [4.736634198230005]
We study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ.
We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss.
arXiv Detail & Related papers (2024-05-12T02:15:26Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - Benchmarking the Reliability of Post-training Quantization: a Particular
Focus on Worst-case Performance [53.45700148820669]
Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures.
Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored.
This paper first investigates this problem on various commonly-used PTQ methods.
arXiv Detail & Related papers (2023-03-23T02:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.