Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation
Abstract Overview
This paper studies a failure mode in on-policy distillation for reasoning models, arguing that teacher supervision degrades along student-generated trajectories. The authors define this effect as Supervision Fidelity Decay (SFD), and show empirically that as student prefixes become longer, the teacher's next-token confidence and downstream completion accuracy both decline. They further analyze reverse-KL distillation theoretically, claiming that diffuse teacher distributions cause the corrective gradient to collapse into a weaker student-driven signal, which compounds drift over long reasoning chains. To mitigate this, they propose Lookahead Group Reward (LGR), which scores candidate tokens by the teacher's confidence at the next step and adds this signal to standard on-policy distillation with an efficiency-oriented entropy-triggered tree-attention mechanism.
Novelty
The paper's main novelty is the explicit characterization of supervision fidelity decay as a position-dependent structural problem in on-policy reverse-KL distillation, rather than treating weak performance at long horizons as only an optimization issue. It also introduces a one-step lookahead, group-normalized confidence reward that uses the teacher's future-step confidence to preserve useful supervision under drift.
Results
Across six math and code benchmarks, LGR improves mean@8 over OPD by 1.61 points for a 1.5B student and by 2.57 points for a 7B student. The gains become larger as maximum generation length increases, including a +4.92 mean@8 improvement on AIME-26 at 39k tokens, and training diagnostics show higher teacher log-probability and more stable entropy than OPD.
Key Points
- The authors identify Supervision Fidelity Decay: as student-generated prefixes lengthen, teacher confidence and completion quality decrease, weakening the usefulness of reverse-KL supervision.
- LGR addresses this by evaluating top-K student token candidates using the teacher's next-step peak probability, normalizing rewards within the candidate group, and activating the extra computation mainly at high-entropy positions.
- Experiments indicate that LGR outperforms OPD and other distillation baselines on average, with the strongest improvements appearing on longer reasoning trajectories where supervision decay is most severe.