Related papers: MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

URL: http://arxiv.org/abs/2511.05811v1
Date: Sat, 08 Nov 2025 02:51:26 GMT
Title: MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu,
Abstract summary: Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights.<n>We propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability.
Score: 29.545879706181974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

Related papers

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning [12.855945066222743]
This report presents a practical FP8 rollout stack for large language models (LLMs)<n>We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks, and (iii) mitigate mismatch using importance-based rollout correction.<n>Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
arXiv Detail & Related papers (2026-01-26T05:12:05Z)
Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Towards Fully FP8 GEMM LLM Training at Scale [77.97607456493257]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost.<n>With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
Scaling Laws for Floating Point Quantization Training [47.174957621592775]
This paper explores the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in FP quantization training performance of LLM models.<n>We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers.
arXiv Detail & Related papers (2025-01-05T02:30:41Z)
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.<n>COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.<n>COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z)
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs) For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.