Fugu-MT 論文翻訳(概要): Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

論文の概要: Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

arxiv url: http://arxiv.org/abs/2603.19335v1
Date: Thu, 19 Mar 2026 04:10:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:38.808722
Title: Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
Title（参考訳）: 学習後のアルゴリズムは実際は希薄か? スケール依存のランクインバージョンをモデルスケールで検証した研究
Authors: Xiaoyi Li,
Abstract要約: 51のポストトレーニングアルゴリズムを同一のインフラで実装した統合フレームワークを提案する。本研究では,4つのモデルスケール(0.5B--7B),3つの評価領域,20種類のDPO分類にまたがる8つのアルゴリズムについて検討した。 20種類のDPOはボンフェロニ補正後にバニラDPOを著しく上回りませんが、唯一の重要な異常値であるSimPOはより悪くなります。
参考スコア（独自算出の注目度）: 1.6498361958317636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ($-$11.5~pp, $p < 10^{-4}$). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH ($36\times$) and 0.47~pp on general-domain benchmarks ($41\times$), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (${\sim}$50~pp) $\gg$ training paradigm (${\sim}$10~pp) $\gg$ online vs.\ offline (${\sim}$9~pp) $\gg$ loss function (${\sim}$1~pp). We release all code, configs, and evaluation data as a living community benchmark.
Abstract（参考訳）: トレーニング後のアライメントは、DPO、SimPO、KTO、GRPOなど、数十の競合するアルゴリズムを生み出している。我々は,51のポストトレーニングアルゴリズムを同一のインフラで実装した一貫したフレームワークOXRLを提案する。本研究は,4つのモデルスケール(0.5B--7B),3つの評価ドメイン,20種類のDPO分類(それぞれ1.5B,5シード)にまたがる8つのアルゴリズムにまたがる。 3つの見出しが浮かび上がる。 1.5B ではオンライン RL (SGRPO) が GSM8K 上で 58.0\%~$\pm$0.57 で全てのメソッドを上回り、7B では最悪の小規模メソッド (SimPO) が最高 (85.8\%) となり、LoRA 正規化よりもモデルスケールによって駆動される完全なランクインバージョン (2$\times$2 factorial で確認されている) である。 2~ロス関数の修正は無視できる利得をもたらす: ボンフェロニ補正後の20のDPO変種のうち、バニラDPOを著しく上回り、唯一の重要な外れ値であるSimPOは、より悪い(-$11.5~pp, $p < 10^{-4}$)。 19.3~pp GSM8Kスプレッド崩壊はMATH(36\times$)で0.54〜pp、一般ベンチマーク(41\times$)で0.47〜ppとなり、アルゴリズムの選択が主にトレーニング分布内で重要であることを確認する。モデルスケール($50〜pp)$\gg$トレーニングパラダイム($10〜pp)$\gg$オンライン vs. モデルスケール($50〜pp)$\gg$トレーニングパラダイム($10〜pp)。 \ offline${\sim}$9~pp) $\gg$ loss function${\sim}$1~pp) 生きたコミュニティベンチマークとして、すべてのコード、設定、評価データをリリースします。

論文の概要: Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

関連論文リスト