Fugu-MT 論文翻訳(概要): AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

論文の概要: AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

arxiv url: http://arxiv.org/abs/2605.18529v1
Date: Mon, 18 May 2026 15:14:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.898116
Title: AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
Title（参考訳）: AMR-SD:Token-Level Credit Assignmentのための非対称メタリフレクティブ自己蒸留法
Authors: Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin,
Abstract要約: 非対称メタ反射型自己蒸留(AMR-SD) 非対称なReLUゲートしきい値を持つ因果情報ゲイン(CIG)を導入し、これらの反射をスパースで高精度なトークンレベルの利点変調に変換する。科学的、数学的、ツール使用のベンチマークによる実験は、AMR-SDが既存のベースラインを大幅に上回っていることを示している。
参考スコア（独自算出の注目度）: 39.63424981516754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.
Abstract（参考訳）: 複雑な推論のためのLarge Language Models (LLM) のアライメントは、Reinforcement Learning with Verifiable Rewards (RLVR) に大きく依存している。しかし、GRPOのような標準的なアルゴリズムは全てのトークンにシーケンスレベルの報酬を均一に適用し、深刻なクレジット割り当てボトルネックを生み出します。政治上の自己蒸留は、特権的な文脈で自己教育者を条件付けすることでこれを解決しようとするが、生のオラクルソリューションへの直接的な露出は、しばしば過条件の教師分布、暗黙の回答リーク、後期の訓練崩壊を引き起こす。これらの制限を克服するために,非対称なメタ反射型自己蒸留(AMR-SD)を提案する。検証結果、ピアロールアウト、参照フィードバックなど、診断信号を簡潔で自己生成されたソクラテス的なヒントや批評に圧縮する。さらに、非対称なReLUゲートしきい値を持つ因果情報ゲイン(CIG)を導入し、これらの反射をスパースで高精度なトークンレベルの利点変調に変換する。時間的熱処理と組み合わせることで, この機構は分布雑音を除去しながら, 基礎的環境報酬を保ち得る。科学的、数学的、ツール使用のベンチマークによる実験では、AMR-SDは既存のベースラインを著しく上回り、堅牢な長距離安定性を実現し、後期の崩壊を防ぐことに成功した。

論文の概要: AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

関連論文リスト