Related papers: TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

URL: http://arxiv.org/abs/2510.27527v1
Date: Fri, 31 Oct 2025 14:57:16 GMT
Title: TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Authors: Yuxiang Chen, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, Jianfei Chen,
Abstract summary: Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT)<n>We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers.
Score: 24.897675627585798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.

Related papers

Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
FP4 All the Way: Fully Quantized Training of LLMs [26.195547788434908]
We demonstrate fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision.<n>We investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods.
arXiv Detail & Related papers (2025-05-25T12:14:25Z)
Quartet: Native FP4 Training Can Be Optimal for Large Language Models [27.800012997794987]
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs.<n> NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants.<n>We introduce a new approach for accurate, end-to-end FP4 training with all the major computations in low precision.
arXiv Detail & Related papers (2025-05-20T17:55:50Z)
Oscillation-Reduced MXFP4 Training for Vision Transformers [19.642508885867375]
Pre-training Transformers in FP4 precision comes with a considerable loss of accuracy.<n>Training with MXFP4 data format still results in significant degradation.<n>We propose a novel training method TetraJet for a more accurate FP4 training.
arXiv Detail & Related papers (2025-02-28T08:51:55Z)
Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost.<n>With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs [48.55966021231297]
We present HALO, a novel quantization-aware training approach for Transformers.<n>Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision.<n>Applying to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks.
arXiv Detail & Related papers (2025-01-05T18:41:54Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [62.932299614630985]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.<n>FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.