Related papers: Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

URL: http://arxiv.org/abs/2601.22813v1
Date: Fri, 30 Jan 2026 10:39:11 GMT
Title: Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
Authors: Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh,
Abstract summary: We improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats.<n>We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications.
Score: 40.140261007984215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .

Related papers

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling [13.357423392911036]
We introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values.<n>We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform.<n>We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy.
arXiv Detail & Related papers (2025-12-01T18:59:45Z)
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z)
Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs [12.050611514060023]
We propose Unified Progressive Quantization (UPQ) that unifies block-wise post-training quantization with distillation-based quantization-aware training (Distill-QAT) for instruction-tuned LLM quantization.<n>To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data.
arXiv Detail & Related papers (2025-06-10T16:26:32Z)
FP4 All the Way: Fully Quantized Training of LLMs [26.195547788434908]
We demonstrate fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision.<n>We investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods.
arXiv Detail & Related papers (2025-05-25T12:14:25Z)
Quartet: Native FP4 Training Can Be Optimal for Large Language Models [27.800012997794987]
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs.<n> NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants.<n>We introduce a new approach for accurate, end-to-end FP4 training with all the major computations in low precision.
arXiv Detail & Related papers (2025-05-20T17:55:50Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.