Fugu-MT 論文翻訳(概要): Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

論文の概要: Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

arxiv url: http://arxiv.org/abs/2606.18216v1
Date: Tue, 16 Jun 2026 17:46:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.585441
Title: Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
Title（参考訳）: 親密な政策最適化のゾーン:教師はプロンプトの教師であり、グラディエントではない
Authors: Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma,
Abstract要約: 近親政策最適化ゾーン(ZPPO)は、ヴィゴツキーの近親開発ゾーンに触発されたものである。 ZPPOは1つの正しい教師の反応と1つの間違った学生の反応を、生徒が識別しなければならない匿名候補としてペアリングする。プロンプト再生バッファは、生徒の平均ロールアウト精度が半分に達するか、FIFOが削除されるまで、各難問を再循環する。
参考スコア（独自算出の注目度）: 89.89504265663057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.
Abstract（参考訳）: 知識蒸留は、教師の能力を小学生に伝達するが、小学生制では脆弱である: 学生に、はるかに大きな教師のロジットを模倣させ、教師の鋭いモードに集中させ、訓練コーパスを超えたベンチマークファミリーの一般化を損なう。強化学習(Reinforcement Learning, RL)は、学生自身のロールアウトのトレーニングによって、ロジットの模倣を避ける。しかし、全てのロールアウトがゼロの利点を得られず、静かに捨てられるという問題に対して、より強い教師の反応を政策勾配に注入することは、政治上の仮定を破り、ドリフトを誘発する。本稿では, 政策勾配ではなく, 教師をプロンプト内に留める, ヴィゴツキーの近縁開発ゾーンにインスパイアされたZPPOを導入する。厳密な質問に対して、ZPPOは2つの改革されたプロンプトを構築している: バイナリ候補(Binary Candidate-included Question, BCQ)は1つの正しい教師応答と、学生が識別しなければならない匿名の候補として1つの間違った学生反応をペアリングし、負候補(Negative Candidate-included Question, NCQ)は、学生の間違ったロールアウトを単一のプロンプトに集約し、共有障害モードをサーブする。プロンプト再生バッファは、生徒の平均ロールアウト精度が半分に達するか、または有限容量でFIFOを排除され、生徒の現在の近位発達領域内でBCQとNCQを増幅するまで、各難問を再循環する。 Qwen3.5 family at four students scales (0.8B-9B) with a 27B teacher, after-trained as vision- language models and evaluate on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the most gains at least scale。

論文の概要: Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

関連論文リスト