Fugu-MT 論文翻訳(概要): On-Policy Distillation with Best-of-N Teacher Rollout Selection

論文の概要: On-Policy Distillation with Best-of-N Teacher Rollout Selection

arxiv url: http://arxiv.org/abs/2605.09725v2
Date: Wed, 13 May 2026 03:08:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.839858
Title: On-Policy Distillation with Best-of-N Teacher Rollout Selection
Title（参考訳）: ベスト・オブ・N教師のロールアウト選択によるオン・ポリシィ蒸留
Authors: Ke Zhang, Yunjie Tian, Dongdi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu,
Abstract要約: 本報告では, オンライン蒸留のためのベスト・オブ・Nロールアウト教員選抜フレームワークBRTSを提案する。 BRTSは、教師軌道から構築された教師コンテキスト管理ブランチで、標準の学生コンテキストOPDを強化する。 BRTSは、挑戦的な推論ベンチマークにおいて、標準的なPDよりも改善されており、より難しいデータセットに対して最大の利益がある。
参考スコア（独自算出の注目度）: 54.91780727674628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.
Abstract（参考訳）: オンライン蒸留(OPD, On-policy distillation)は, 学生を自前のサンプル軌道で監督する手法であり, 強化学習の報酬依存や, 標準的な微調整でしばしば見られる破滅的な忘れを回避しつつ, 推論を改善するためのデータ効率のよいポストトレーニング手法として登場した。しかし、標準OPDは通常、ノイズの多い学生生成状況下で教師の監督を計算し、プロンプトごとに1つの確率的な教師のロールアウトに依存することが多い。その結果、教師の軌跡は、生徒の現在の推論行動に不正確、非形式的、あるいは不整合である。この制限に対処するために, BRTS は, オンライン蒸留のための最良Nロールアウト教師選択フレームワークである。 BRTSは、カリキュラムされた教師軌道から構築された教師コンテキスト管理ブランチで、標準の学生コンテキストOPDを強化する。 BRTSは、最初のサンプリングされた教師のロールアウトから蒸留する代わりに、教師の軌道の小さなプールをサンプリングし、単純な優先規則(正当性第一、学生のアライメント第二)を用いて補助的軌道を選択する。複数の正しい教師の軌道が利用できる場合、BRTSは生徒の現在の行動に最も適しているものを選び、無条件の教師のサンプルがより厳しいプロンプトで失敗すると、自然の導出を誘発するために地道な条件の回復ステップを起動する。選択された軌跡は、教師の軌跡に補助的損失を伴い、PDループ内における教師コンテキストの信頼性を高めるために使用される。 AIME 2024、AIME 2025、AMC 2023の実験では、BRTSは、挑戦的な推論ベンチマークにおいて標準OPDよりも改善され、より厳しいデータセットで最大の増加が示されている。私たちのコードはhttps://github.com/BWGZK-keke/BRTS.comで公開されています。

論文の概要: On-Policy Distillation with Best-of-N Teacher Rollout Selection

関連論文リスト