Fugu-MT 論文翻訳(概要): APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

論文の概要: APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

arxiv url: http://arxiv.org/abs/2509.18521v3
Date: Fri, 26 Sep 2025 22:20:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 11:50:46.812811
Title: APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
Title（参考訳）: APRIL:Tyo-tail 生成のための強化学習におけるアクティブ部分ロールアウト
Authors: Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum,
Abstract要約: 強化学習(RL)は、大規模事前訓練言語モデル(LLM)の進展の基盤となっている。強化学習におけるアクティブ部分ロールアウト(APRIL)を提案する。 APRILはロールアウト要求をオーバープロビジョンし、ターゲットのレスポンス数が到達したら終了し、将来のステップで継続するために不完全なレスポンスをリサイクルする。
参考スコア（独自算出の注目度）: 40.120847511378365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community's growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by 22.5% on average (at most 44%) across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves 2.1% on average(at most 8%) higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL
Abstract（参考訳）: 強化学習(RL)は,大規模事前訓練言語モデル(LLM)の進展の基盤となっている。 GPT-oシリーズ、DeepSeek-R1、Kim-K1.5、Grok 4、GLM-4.5といった世代は、推論とコーディング能力を高めるために大規模なRLトレーニングに依存している。コミュニティのRLのニーズを満たすために、多くのRLフレームワークが提案されている。しかし、RLトレーニングは計算コストがかかり、ロールアウト生成は全ランタイムの90%以上を占める。さらに、その効率はロールアウトレスポンス長の長いテール分布によって制限されることが多く、いくつかの応答がバッチ全体を停止し、GPUはアイドル状態のままにして未使用のままである。モデルとロールアウトのサイズが拡大するにつれて、このボトルネックはスケーラビリティをますます制限します。この課題に対処するために、長い尾の非効率を緩和する強化学習におけるアクティブ部分ロールアウト(APRIL)を提案する。ロールアウトフェーズでは、APRILはロールアウト要求をオーバープロビジョンし、ターゲットのレスポンス数が到達したら終了し、将来のステップで継続するために不完全なレスポンスをリサイクルする。この戦略により、GPUアイドル時間を大幅に削減しながら、ロールアウトが破棄されないことが保証される。実験の結果、APRILは一般的なRLアルゴリズム(GRPO、DAPO、GSPO)で平均22.5%(最大44%)のロールアウトスループットを改善し、収束を加速し、タスク間の最終精度を平均2.1%向上した。さらに、APRILはフレームワークとハードウェアに依存しないため、すでにスリムなRLフレームワークに統合されており、NVIDIAやAMD GPUにもデプロイ可能である。この研究は、RL訓練効率の向上とRLシステムのさらなる最適化をめざして、APRILの提案におけるシステムレベルとアルゴリズムの考慮を統一するものである。私たちのコードベースはhttps://github.com/RLsys-Foundation/APRILで利用可能です。

論文の概要: APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

関連論文リスト