FuguReport

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Authors Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu
Affiliations Chinese Academy of Sciences / University of the Chinese Academy of Sciences / Xiaohongshu
Categories Method / Knowledge Distillation / On-policy student training, Task / Sequence Modeling / Inference ability transfer, Evaluation / Model Fidelity Evaluation / Supervision fidelity decay mitigation
License CC BY 4.0

Abstract Overview

This paper studies a failure mode in on-policy distillation for reasoning models, arguing that teacher supervision degrades along student-generated trajectories. The authors define this effect as Supervision Fidelity Decay (SFD), and show empirically that as student prefixes become longer, the teacher's next-token confidence and downstream completion accuracy both decline. They further analyze reverse-KL distillation theoretically, claiming that diffuse teacher distributions cause the corrective gradient to collapse into a weaker student-driven signal, which compounds drift over long reasoning chains. To mitigate this, they propose Lookahead Group Reward (LGR), which scores candidate tokens by the teacher's confidence at the next step and adds this signal to standard on-policy distillation with an efficiency-oriented entropy-triggered tree-attention mechanism.

Novelty

The paper's main novelty is the explicit characterization of supervision fidelity decay as a position-dependent structural problem in on-policy reverse-KL distillation, rather than treating weak performance at long horizons as only an optimization issue. It also introduces a one-step lookahead, group-normalized confidence reward that uses the teacher's future-step confidence to preserve useful supervision under drift.

Results

Across six math and code benchmarks, LGR improves mean@8 over OPD by 1.61 points for a 1.5B student and by 2.57 points for a 7B student. The gains become larger as maximum generation length increases, including a +4.92 mean@8 improvement on AIME-26 at 39k tokens, and training diagnostics show higher teacher log-probability and more stable entropy than OPD.

Key Points

  1. The authors identify Supervision Fidelity Decay: as student-generated prefixes lengthen, teacher confidence and completion quality decrease, weakening the usefulness of reverse-KL supervision.
  2. LGR addresses this by evaluating top-K student token candidates using the teacher's next-step peak probability, normalizing rewards within the candidate group, and activating the extra computation mainly at high-entropy positions.
  3. Experiments indicate that LGR outperforms OPD and other distillation baselines on average, with the strongest improvements appearing on longer reasoning trajectories where supervision decay is most severe.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.