Related papers: FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

URL: http://arxiv.org/abs/2511.02302v1
Date: Tue, 04 Nov 2025 06:36:59 GMT
Title: FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Authors: Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun,
Abstract summary: Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands.<n>We propose FP8-Flow-MoE, a training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware computation and fused FP8 operators.
Score: 3.281844093101284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and na\"ive FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.

Related papers

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning [12.855945066222743]
This report presents a practical FP8 rollout stack for large language models (LLMs)<n>We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks, and (iii) mitigate mismatch using importance-based rollout correction.<n>Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
arXiv Detail & Related papers (2026-01-26T05:12:05Z)
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling [29.545879706181974]
Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights.<n>We propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability.
arXiv Detail & Related papers (2025-11-08T02:51:26Z)
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic [9.192731482247103]
Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training.<n>We propose FALQON, a novel framework that eliminates the quantization overhead from separate low-rank adaptation (LoRA) computational paths.<n>FALQON achieves approximately a 3$times$ training speedup over existing quantized LoRA methods with a similar level of accuracy.
arXiv Detail & Related papers (2025-10-28T04:44:49Z)
Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields [51.95157731126864]
Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost.<n>This thesis aims to make MACE cheaper and faster by identifying computational bottlenecks and evaluating low-precision execution policies.
arXiv Detail & Related papers (2025-10-23T14:02:34Z)
Towards Fully FP8 GEMM LLM Training at Scale [77.97607456493257]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
An Inquiry into Datacenter TCO for LLM Inference with FP8 [18.01919466758935]
We analyze the computational characteristics of large language models (LLMs) inference from a TCO perspective.<n>We investigate key workload characteristics influencing TCO for AI accelerators from Intel (Gaudi 2 & 3) and NVIDIA (H100 & H200)<n>We find that Gaudi HPUs achieve superior utilization on thin GEMMs compared to their counterparts, especially in FP8-quantized models.
arXiv Detail & Related papers (2025-02-03T05:26:22Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
FP8-BERT: Post-Training Quantization for Transformer [20.51143486483669]
Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production. New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100. We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
arXiv Detail & Related papers (2023-12-10T02:14:34Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)
Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU) We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.