Fugu-MT 論文翻訳(概要): Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts

論文の概要: Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts

arxiv url: http://arxiv.org/abs/2606.21994v1
Date: Sat, 20 Jun 2026 11:18:34 GMT
ステータス: 情報取得中
システム内更新日: 2026-06-23 15:02:26.553984
Title: Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts
Title（参考訳）: プリフィックスガイドオン・ポリシィ蒸留:ロールアウトからのゴールデントラジェクトリーのマイニング
Authors: Qingfei Zhao, Huan Song, Shuyu Tian, Jiawei Shao, Xuelong Li,
Abstract要約: Prefix-Guided On-Policy Distillation (PG-OPD) は、固定長プレフィックスを用いて、高価な長距離発生前に軌跡値を推定する単純なロールアウト・アロケーションフレームワークである。 AMC、AIME、HMMTベンチマークの様々な教師/学生の組み合わせで、PG-OPDはトレーニング時間を最大2.46倍にし、平均精度を4.80ポイントまで改善している。
参考スコア（独自算出の注目度）: 48.550535291129584
License:
Abstract: On-policy distillation (OPD) improves reasoning models by applying dense teacher supervision on student-sampled trajectories. However, scaling OPD to long-horizon mathematical reasoning exposes a reliability and efficiency problem: standard OPD assigns every sampled candidate the same long rollout budget, even though some trajectories may quickly become weakly aligned with the teacher and provide less useful supervision. Prior analyses suggest that successful OPD depends on local teacher-student compatibility, which can be measured by top-k overlap on student-visited prefixes. When this overlap is low, continuing to generate or train on long suffixes may waste computation and introduce noisy learning signal. To address this, we introduce Prefix-Guided On-Policy Distillation (PG-OPD), a simple rollout-allocation framework that uses fixed-length prefixes to estimate trajectory value before expensive long-horizon generation. PG-OPD first decodes every sampled candidate to the same prefix length, computes teacher-student top-k overlap within an early probe window of that prefix, and selectively continues high-overlap candidates to a fixed long length. Low-overlap candidates stop at the fixed prefix, avoiding unnecessary suffix generation. Across diverse teacher-student combinations on AMC, AIME, and HMMT benchmarks, PG-OPD improves average accuracy by up to 4.80 points while reducing training time by up to 2.46x. These results suggest that prefix-level compatibility provides a practical signal for directing OPD computation toward trajectories that remain learnable from the teacher.
Abstract（参考訳）: オンライン蒸留(OPD)は,高密度教師の指導を学生サンプルの軌跡に適用することにより推論モデルを改善する。標準的なOPDは全てのサンプル候補に同じ長期展開予算を割り当てるが、いくつかの軌道はすぐに教師と弱くなり、あまり役に立たない監督を提供する。先行分析では,OPDの成功は教師と生徒の親和性に依存することが示唆され,学生が訪問する接頭辞の上位kの重複によって測定できる。この重なり合いが低い場合、長い接尾辞の生成や訓練は計算を無駄にし、ノイズの多い学習信号を導入する。これを解決するために,固定長プレフィックスを用いて,高価な長距離発生前の軌跡値を推定するシンプルなロールアウト配置フレームワークであるPrefix-Guided On-Policy Distillation (PG-OPD)を導入する。 PG-OPDはまず、すべてのサンプル候補を同じ接頭辞長にデコードし、その接頭辞の早期プローブウィンドウ内で教師と学生によるトップkオーバーラップを計算し、高いオーバーラップ候補を固定長に選択的に継続する。低オーバーラップ候補は固定プレフィックスで停止し、不要な接尾辞の発生を避ける。 AMC、AIME、HMMTベンチマークの様々な教師/学生の組み合わせで、PG-OPDはトレーニング時間を最大2.46倍にし、平均精度を4.80ポイントまで改善している。これらの結果から,プレフィックスレベルの互換性は,教師から学習可能なトラジェクトリにOPD計算を向ける実用的な信号を提供する可能性が示唆された。

論文の概要: Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts

関連論文リスト