Fugu-MT 論文翻訳(概要): How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

論文の概要: How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

arxiv url: http://arxiv.org/abs/2605.21266v1
Date: Wed, 20 May 2026 14:53:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.73913
Title: How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR
Title（参考訳）: オンラインRLはいくらで十分か? RLVRにおけるオフライン優先最適化のためのインフォームティブロールアウト
Authors: Richa Verma, Balaraman Ravindran,
Abstract要約: G2D(GRPO to DPO)は,短時間のGRPOウォームアップを実行し,静的な選好データセットを構築し,DPOでオフラインでモデルを微調整する3段階パイプラインである。温暖化マッチを適度に設定したオフラインDPOは,計算コストが大幅に低いGRPOより優れていた。その結果、RLVRにおけるオフライン-オフラインのギャップは、主にデータ伝達性の問題であり、微調整データセットのキャリブレーションが困難である短いオンラインRLウォームアップを、オンラインRLの計算効率の代替品として同定した。
参考スコア（独自算出の注目度）: 7.0964309805625945
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.
Abstract（参考訳）: Reinforcement Learning from Verifiable Rewards (RLVR) は言語モデルにおける推論の強力なパラダイムとして登場し、GRPOが主要な例となっている。しかし、GRPOは継続的なオンラインロールアウト生成を必要とするため、計算コストが高く、スケールが難しい。 Direct Preference Optimization (DPO) は、安定的で効率的なオフライン代替手段を提供するが、コールド教師付き微調整(SFT)ポリシーのロールアウトで訓練された場合には、GRPOのようなオンラインRLメソッドを過小評価することが典型的である。 G2D(GRPO to DPO)}は,短時間のGRPOウォームアップを実行し,静的な嗜好データセットを構築し,DPOでオフラインでモデルを微調整する3段階パイプラインである。 Qwen2.5-7B と Llama-3.1-8B 上の GRPO のオンラインステップ数 (K) の値のセット全体で, 温暖化マッチが適度なオフライン DPO や, GRPO をかなり低い計算コストで上回っていることが判明した。 Qwen2.5-7Bでは、K=150でG2Dが62.4%のMATH-500を達成し、GRPO(51.6%)を10.8%下回った。 Llama-3.1-8Bでは、K=500でのG2Dは49.4%となり、GRPOを上回った。性能は好みのペアの数によって制御されるのではなく、Kではあまり異なるのではなく、その情報によって制御されることを示す。適度なウォームアップは、校正された不確実性を伴うロールアウトを生成し、強いコントラストシグナルを発生させる一方、過剰なウォームアップは過度に自信過剰なポリシーと情報の少ないデータをもたらす。その結果、RLVRにおけるオフライン-オフラインのギャップは、主にデータ伝達性の問題であり、微調整データセットのキャリブレーションが困難である短いオンラインRLウォームアップを、オンラインRLの計算効率の代替品として同定した。

論文の概要: How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

関連論文リスト