AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
- URL: http://arxiv.org/abs/2411.09909v2
- Date: Fri, 30 May 2025 01:11:39 GMT
- Title: AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
- Authors: Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, Jungwook Choi,
- Abstract summary: AMXFP4 is a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales.<n>AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA.
- Score: 6.699442219974261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) grow in parameter size and context length, computation precision has been reduced from 16-bit to 4-bit to improve inference efficiency. However, this reduction causes accuracy degradation due to activation outliers. Rotation-based INT4 methods address this via matrix calibration, but they introduce multi-hour overheads and leave key computations in full precision. Microscaling (MX) floating-point (FP) formats offer fine-grained representation with a shared scale, enabling fully quantized matrix multiplications through direct casting without calibration. However, existing research shows unsatisfactory empirical results for MXFP4 inference, and the robustness of MX formats remains largely unexplored. In this work, we uncover the fundamental tradeoffs of the MX format: while it effectively suppresses activation outliers, it does so at the cost of increased group-wise asymmetry. To address this, we propose AMXFP4, a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales, without requiring calibration. Our custom MAC engine adds negligible hardware cost while improving accuracy: AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA. It also surpasses recently deployed commercial MXFP4 variants. Code: https://github.com/aiha-lab/MX-QLLM
Related papers
- MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models [3.305409455598179]
Quantization significantly accelerates inference in large language models (LLMs)<n>Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format.<n>We propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats.
arXiv Detail & Related papers (2025-08-04T12:22:39Z) - Quartet: Native FP4 Training Can Be Optimal for Large Language Models [27.800012997794987]
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs.<n> NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants.<n>We introduce a new approach for accurate, end-to-end FP4 training with all the major computations in low precision.
arXiv Detail & Related papers (2025-05-20T17:55:50Z) - QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition [21.13478769431063]
QUAD (Quantization with Activation Decomposition) is a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization.
We show QUAD achieves 94% 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models.
arXiv Detail & Related papers (2025-03-25T05:03:56Z) - ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts [79.62448915248926]
Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing accuracy over the 16-bit model inference.<n>We propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4.<n>In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline.
arXiv Detail & Related papers (2025-03-17T08:38:45Z) - MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration [23.752021919501207]
We propose MergeQuant, an accurate and efficient per-channel static quantization framework.
MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method.
On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
arXiv Detail & Related papers (2025-03-07T04:52:28Z) - Oscillation-Reduced MXFP4 Training for Vision Transformers [19.642508885867375]
Pre-training Transformers in FP4 precision comes with a considerable loss of accuracy.<n>Training with MXFP4 data format still results in significant degradation.<n>We propose a novel training method TetraJet for a more accurate FP4 training.
arXiv Detail & Related papers (2025-02-28T08:51:55Z) - Training LLMs with MXFP4 [15.084813381461903]
We present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2times$ faster than FP8 on supported hardware.<n>Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3times$ over FP8 and $>1.7times$ over BF16 during backpropagation.
arXiv Detail & Related papers (2025-02-27T23:01:31Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.
This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring [2.983583925806601]
We propose QRazor, a simple yet effective quantization scheme that enables 4-bit quantization of weights, activations, and KV cache in transformer-based language models.
QRazor operates in two stages: first, quantizing data using 8 or 16-bit integers as a basis with absolute max scaling to preserve accuracy close to full-precision models, and second, compressing the quantized data to 4-bit using our significant data razoring (SDR) technique.
arXiv Detail & Related papers (2025-01-23T02:20:08Z) - Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale.
Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation.
We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation.
The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z) - HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization [10.307268005739202]
Diffusion Transformers (DiTs) have recently gained substantial attention for their superior visual generation capabilities.
DiTs also come with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones.
We introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference.
arXiv Detail & Related papers (2024-05-30T06:56:11Z) - SpinQuant: LLM quantization with learned rotations [49.07335692298487]
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs)<n>We identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy.<n>We propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy.
arXiv Detail & Related papers (2024-05-26T02:15:49Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - FlattenQuant: Breaking Through the Inference Compute-bound for Large
Language Models with Per-tensor Quantization [6.931020818874328]
We introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss.
Our work achieves up to 2$times$ speedup and 2.3$times$ memory reduction for LLMs with negligible loss in accuracy.
arXiv Detail & Related papers (2024-02-28T02:00:34Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
This paper focuses on post-training quantization (PTQ) in Large Language Models (LLMs)
We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC)
We demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models.
arXiv Detail & Related papers (2023-11-09T06:19:51Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.