Related papers: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

URL: http://arxiv.org/abs/2601.07475v1
Date: Mon, 12 Jan 2026 12:27:22 GMT
Title: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
Authors: Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma,
Abstract summary: ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
Score: 4.431548809730958
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of fine-grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post-Training Quantization (PTQ) strategies to these formats: rotation-based methods compromise fine-grained block isolation; smoothing techniques struggle with significant 4-bit quantization errors; and mixed-precision approaches often conflict with hardware constraints on unified-precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst-case error bound of our dual-stage NVFP4 quantization is comparable to that of standard 8-bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at https://github.com/actypedef/ARCQuant .

Related papers

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling [13.357423392911036]
We introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values.<n>We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform.<n>We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy.
arXiv Detail & Related papers (2025-12-01T18:59:45Z)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z)
Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models [10.000323762676633]
Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements.<n>We propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models.<n>Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision.
arXiv Detail & Related papers (2025-09-30T15:55:42Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving [16.392977892243895]
LiquidGEMM is a hardware-efficient quantization method for 4-bit weight and 8-bit activation quantization.<n> LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup.
arXiv Detail & Related papers (2025-09-01T08:16:20Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models [49.65286242048452]
We propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM)<n>The proposed method substantially outperforms the state-of-the-art approaches in most cases.
arXiv Detail & Related papers (2024-12-21T16:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.