MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
- URL: http://arxiv.org/abs/2510.14557v1
- Date: Thu, 16 Oct 2025 11:05:54 GMT
- Title: MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving
- Authors: Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim,
- Abstract summary: Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs)<n>In this paper, we focus on recent industry-driven variants of block floating-point (BFP) formats.<n>We propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats.
- Score: 4.176741972965246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.
Related papers
- INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z) - MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models [3.305409455598179]
Quantization significantly accelerates inference in large language models (LLMs)<n>Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format.<n>We propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats.
arXiv Detail & Related papers (2025-08-04T12:22:39Z) - ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts [79.62448915248926]
Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing accuracy over the 16-bit model inference.<n>We propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4.<n>In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline.
arXiv Detail & Related papers (2025-03-17T08:38:45Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs)<n>In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR)<n>RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z) - Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models [5.680646377552021]
Nanoscaling (NxFP) proposes three techniques to enable better accuracy and smaller memory footprint than state-of-the-art MxFP.<n>NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.
arXiv Detail & Related papers (2024-12-15T22:18:20Z) - DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [59.96455188197593]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.<n>We propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data.<n> Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning.
arXiv Detail & Related papers (2024-11-21T12:02:39Z) - AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference [6.699442219974261]
AMXFP4 is a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales.<n>AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA.
arXiv Detail & Related papers (2024-11-15T03:11:19Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC [34.63351580241698]
We introduce a comprehensive PPI framework called SecFormer to achieve fast and accurate PPI for Transformer models.<n>In terms of efficiency, SecFormer is 3.57 and 3.58 times faster than PUMA for BERT$_textBASE$ and BERT$_textLARGE$, demonstrating its effectiveness and speed.
arXiv Detail & Related papers (2024-01-01T15:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.