Related papers: FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

URL: http://arxiv.org/abs/2601.18150v1
Date: Mon, 26 Jan 2026 05:12:05 GMT
Title: FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Authors: Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, Junjie Lai,
Abstract summary: This report presents a practical FP8 rollout stack for large language models (LLMs)<n>We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks, and (iii) mitigate mismatch using importance-based rollout correction.<n>Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
Score: 12.855945066222743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.

Related papers

QuRL: Efficient Reinforcement Learning with Quantized Rollout [23.326106976928898]
Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs)<n>Due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70% of the total training time.<n>We propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout.
arXiv Detail & Related papers (2026-02-15T01:48:10Z)
Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow [48.48936574810267]
We present the first comprehensive study of FP8 RL training.<n>We propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization.
arXiv Detail & Related papers (2026-01-20T18:54:31Z)
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling [29.545879706181974]
Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights.<n>We propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability.
arXiv Detail & Related papers (2025-11-08T02:51:26Z)
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error [3.281844093101284]
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands.<n>We propose FP8-Flow-MoE, a training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware computation and fused FP8 operators.
arXiv Detail & Related papers (2025-11-04T06:36:59Z)
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs [80.76334908639745]
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs)<n>QeRL addresses issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA)<n>Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase.
arXiv Detail & Related papers (2025-10-13T17:55:09Z)
Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
Towards Fully FP8 GEMM LLM Training at Scale [77.97607456493257]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.