Related papers: Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

URL: http://arxiv.org/abs/2601.14243v2
Date: Sat, 24 Jan 2026 05:17:44 GMT
Title: Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
Authors: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu,
Abstract summary: We present the first comprehensive study of FP8 RL training.<n>We propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization.
Score: 48.48936574810267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.

Related papers

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning [12.855945066222743]
This report presents a practical FP8 rollout stack for large language models (LLMs)<n>We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks, and (iii) mitigate mismatch using importance-based rollout correction.<n>Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
arXiv Detail & Related papers (2026-01-26T05:12:05Z)
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs [80.76334908639745]
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs)<n>QeRL addresses issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA)<n>Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase.
arXiv Detail & Related papers (2025-10-13T17:55:09Z)
Pretraining Large Language Models with NVFP4 [53.235038214986865]
We introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format.<n>Our method integrates two-dimensional quantization scheme for consistent representations across both the forward and backward passes.<n>Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline.
arXiv Detail & Related papers (2025-09-29T17:53:17Z)
InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models [34.21089641502727]
We introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning.<n>Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.
arXiv Detail & Related papers (2025-09-26T16:16:49Z)
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs [4.5440077473497364]
Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing. The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process.
arXiv Detail & Related papers (2024-11-10T15:19:42Z)
To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.<n>However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.<n>We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.