FuguReport

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Authors Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin
Affiliations Meituan / Chinese Academy of Sciences / University of the Chinese Academy of Sciences
Categories Method / Reinforcement Learning / Multi-token prediction training, Method / Optimization / Coefficient calibration via log-prob proxy, Application / Sequential Prediction / Online adaptation in RL training
License CC BY 4.0

Abstract Overview

This paper studies why jointly training multi-token prediction (MTP) with reinforcement learning (RL) often hurts performance during post-training, despite MTP being useful in pretraining. The authors analyze the per-step effect of MTP on the RL objective and decompose it into a first-order gradient-correlation term and a second-order perturbation penalty term. This framework is used to explain the behavior of three regimes—Detach, cross-entropy MTP loss, and policy-loss-based joint training—and to argue that fixed MTP weighting fails because gradient alignment changes over training. Based on this analysis, they propose Optimal Coefficient Calibration (OCC), which adaptively sets the MTP coefficient online using a log-probability proxy instead of expensive full-gradient computation.

Novelty

The main novelty is a theoretical optimization-based account of joint MTP-RL training that unifies common training regimes under a single decomposition and identifies a phase transition from correlation-dominant to penalty-dominant behavior. The paper also introduces OCC, an adaptive coefficient calibration method that tracks the theoretically preferred weighting online through a cheap log-probability proxy.

Results

Across six mathematical reasoning benchmarks, OCC consistently matches or outperforms the Detach baseline, while cross-entropy joint training consistently underperforms and fixed-coefficient policy loss shows a rise-then-fall pattern. The reported results also indicate generalization across RL algorithms (DAPO and GSPO) and across base models, and the adaptive proxy-based method adds negligible training-time overhead relative to Detach.

Key Points

  1. The paper decomposes MTP's effect on RL into a beneficial gradient-alignment term and a harmful quadratic perturbation term, providing a unified explanation for Detach, CE loss, and policy loss regimes.
  2. The proposed OCC method adaptively calibrates the MTP coefficient from online log-probability-change proxies, avoiding full-model gradient computation while tracking training dynamics.
  3. Empirically, OCC is reported to be more stable than fixed-coefficient joint training, to outperform CE-based joint training, and to meet or exceed Detach across multiple reasoning benchmarks, RL algorithms, and model scales.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.