Fugu-MT 論文翻訳(概要): Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

論文の概要: Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

arxiv url: http://arxiv.org/abs/2510.01624v1
Date: Thu, 02 Oct 2025 02:57:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.96474
Title: Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
Title（参考訳）: SFT-RLのポストトレーニングにおけるクアグミレス: 高SFTスコアのミスリードと代わりに何を使うか
Authors: Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani,
Abstract要約: 我々は,高いSFTスコアがRL後の性能向上に寄与するかどうかを検討した。高いSFTスコアは、より単純あるいはより均一なデータに偏りがあり、その後のRLゲインやスケールアップ後の学習効果を確実に予測できない。本稿では,RL結果に対して強力なプロキシを提供するために,代替指標について検討し,ホールドアウト推論例とPass@large kパフォーマンスについて一般化損失を同定する。
参考スコア（独自算出の注目度）: 20.446287312285648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.
Abstract（参考訳）: LLM(Large Language Models)の推論後の訓練では、LLMを2つの独立した段階(Supervised Fine-Tuning (SFT)とReinforcement Learning with Verifiable Rewards (RLVR、以下「RL」と略す)で訓練している。本研究では,高いSFTスコアがRL後の性能向上に寄与するかどうかを課題とする。これは事実ではない広範な反例を提供する。高いSFTスコアは、より単純あるいはより均一なデータに偏りがあり、その後のRLゲインやスケールアップ後の学習効果を確実に予測できない。 SFT性能が向上したモデルでのRLトレーニングは、SFTのないベースモデルでのRLよりも大幅に悪化する可能性がある。本稿では,RL結果に対して強力なプロキシを提供するために,代替指標について検討し,ホールドアウト推論例とPass@large kパフォーマンスについて一般化損失を同定する。 GRPOを通じてSFTとRLVRを使用して、最大12Bパラメータまでのモデルをトレーニングし、7つの数学ベンチマークで256回まで繰り返し、GPU時間に$1M($1M)以上を費やした。実験には、Llama3、Mistral-Nemo、Qwen3、複数の最先端SFT/RLデータセットのモデルが含まれる。事前RL性能から直接予測した場合と比較して、一般化損失とPass@large kに基づく予測は、R^2$係数とSpearmanのランク相関係数を0.5(2x)まで改善し、かなり高い精度を達成する。これは幅広いユースケースに強力なユーティリティを提供する。例えば、ほとんどの実験において、SFTトレーニングは、SFTまたはSFT-then-RLのいずれかの2つのエポックの半例において、一エポックのトレーニングにおいて、一エポックのトレーニングにおいて、一エポックのトレーニングにおいて、一エポックのトレーニングにおいて、一エポックのトレーニングにおいて、一エポックのトレーニングが2エポックのトレーニングにおいて、半エポックのトレーニングでは、同じ予算で、短いサンプルのみのトレーニングは、より優れたSFTのパフォーマンスをもたらすことがあるが、RLのトレーニングでは、様々な長さのトレーニングに比べて、より悪い結果をもたらすことが多い。評価ツールはオープンソースになる。

論文の概要: Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

関連論文リスト