Fugu-MT 論文翻訳(概要): Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

論文の概要: Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

arxiv url: http://arxiv.org/abs/2605.10781v1
Date: Mon, 11 May 2026 16:16:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.971983
Title: Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Title（参考訳）: 反抗的な学生:RLVRを自給自足で探究する教師信号の反転
Authors: Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang,
Abstract要約: 本稿では,本来の自己蒸留信号の逆読法を提案する。教師が予測しなかった道に沿って学生が成功すると、これらのトークンはその自己駆動的推論を反映する。我々はこれを、RLVRにおける新しい形の探索であると解釈し、一様の多様性ではなく、学生自身の成功に根ざした貴重な探検である。
参考スコア（独自算出の注目度）: 24.635100877140747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
Abstract（参考訳）: 自己蒸留はLLMのポストトレーニングの強力なフレームワークとして現れており、教師が余分な情報で条件づけすることで、同じモデルから学生をガイドする。このガイダンスは、学生が失敗したときに役立つが、ロールアウトが成功すると、同じメカニズムが生徒の選択を上書きし、それ自身の推論を抑圧する。そこで,本研究では,教師が予測しなかったパスに沿って学生が成功すると,これらのトークンは自己駆動的推論を反映する,オリジナルの自己蒸留信号を読むことを提案する。そこで我々はRLRT(RLVR with Reversed Teacher)を提案する。我々はこれを、RLVRにおける新しい形の探索であると解釈し、一様の多様性ではなく、学生自身の成功に根ざした貴重な探検である。 RLRTは、ベース、命令調整、思考調整されたQwen3チェックポイントを越えて、自己蒸留と探索ベースのベースラインを大幅に上回り、情報非対称性をRLVRの新しい原則設計軸として確立する。

論文の概要: Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

関連論文リスト