Fugu-MT 論文翻訳(概要): Debunk the Myth of SFT Generalization

論文の概要: Debunk the Myth of SFT Generalization

arxiv url: http://arxiv.org/abs/2510.00237v1
Date: Tue, 30 Sep 2025 20:01:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.237383
Title: Debunk the Myth of SFT Generalization
Title（参考訳）: SFT一般化の神話
Authors: Xiaofeng Lin, Hejian Sang, Zhipeng Wang, Xuezhou Zhang,
Abstract要約: 一般的な見解では、教師付き微調整(SFT)は一般化に失敗するが、強化学習(RL)はより広範な堅牢性を実現する。 SFTが認識する障害の多くは凍結急激な人工物によるものであることを示す。 SFTがより厳密なタスクを一般化できるかどうかを問う。
参考スコア（独自算出の注目度）: 13.700645417996412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A prevailing view holds that supervised fine-tuning (SFT) memorizes training data and fails to generalize, whereas reinforcement learning (RL) attains broader robustness. We revisit this claim through a systematic evaluation on two decision-making benchmarks, Sokoban and General Points, and arrive at a different conclusion. We show that much of SFT's perceived failure stems from frozen-prompt artifacts: when trained on fixed instruction templates, SFT models cling to training semantics rather than adapting to new ones. Introducing prompt diversity during training breaks this shortcut and yields strong generalization to unseen instruction variants without harming in-distribution performance. Beyond instruction shifts, we ask whether SFT can generalize to strictly harder tasks. Here, chain-of-thought (CoT) supervision provides an algorithmic scaffold that markedly improves transfer to more difficult regimes, such as larger Sokoban grids with additional boxes and arithmetic with out-of-distribution values or five-card compositions that increase combinatorial complexity. Finally, combining prompt diversity with CoT achieves the best of both worlds: robust generalization across both instruction-variant and difficulty-variant settings, matching or surpassing RL baselines on our benchmarks while retaining SFT's simplicity and stability. These findings challenge the narrative that SFT is inherently inferior to RL and support a data-centric perspective: with appropriately curated demonstrations, vanilla SFT can generalize as strongly as RL. Code reproducing the results in the paper can be found at: https://github.com/XiaofengLin7/debunking-sft-generalization.
Abstract（参考訳）: 一般的な見解では、教師付き微調整(SFT)はトレーニングデータを記憶し、一般化に失敗するが、強化学習(RL)はより広範な堅牢性を実現する。我々は、この主張を、ソコバンとジェネラルポイントという2つの意思決定ベンチマークの体系的な評価を通じて再検討し、異なる結論に達した。固定的な命令テンプレートでトレーニングすると、SFTモデルは、新しいものに適応するのではなく、セマンティクスのトレーニングに固執する。トレーニング中に急激な多様性を導入することは、このショートカットを破り、分配性能を損なうことなく、目に見えない命令変種に強力な一般化をもたらす。命令シフト以外にも、SFTがより難しいタスクに一般化できるかどうかを問う。ここで、チェーン・オブ・シンクレット(CoT)の監督は、より大きなソコバン・グリッドに追加のボックスを追加したり、アウト・オブ・ディストリビューション値を持つ算術や、組合せ複雑性を増大させる5枚のカード構成といった、より困難な状態への転送を著しく改善するアルゴリズム的な足場を提供する。最後に、急激な多様性をCoTと組み合わせることで、SFTの単純さと安定性を維持しながら、我々のベンチマーク上のRLベースラインに適合または超えるような、命令変種と難変種の両方で堅牢な一般化を実現することができる。これらの知見は、SFTが本質的にRLよりも劣り、データ中心の視点を支えているという物語に挑戦する: 適切にキュレートされたデモでは、バニラSFTはRLと同じくらい強く一般化できる。論文の結果を再現するコードは、https://github.com/XiaofengLin7/debunking-sft- generalizationにある。

論文の概要: Debunk the Myth of SFT Generalization

関連論文リスト