Fugu-MT 論文翻訳(概要): On-Policy Replay for Continual Supervised Fine-Tuning

論文の概要: On-Policy Replay for Continual Supervised Fine-Tuning

arxiv url: http://arxiv.org/abs/2605.29495v1
Date: Thu, 28 May 2026 07:19:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.939733
Title: On-Policy Replay for Continual Supervised Fine-Tuning
Title（参考訳）: 連続監視ファインチューニングのためのオン・ポリシィ・リプレイ
Authors: Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang,
Abstract要約: オン・ポリティクスの信号は、オフ・ポリティクスの監視よりも確実に忘れを減らします。提案手法であるOPR (On-Policy Replay) は, 歴史的プロンプトの小さな予算において, 最新のチェックポイントをロールアウトする。最も鋭いストレステストでは、OPRはBWTを10%のリプレイ予算で-0.65、1%の予算で-2.29に引き上げる。
参考スコア（独自算出の注目度）: 22.944606442798147
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.
Abstract（参考訳）: 連続的教師付き微調整(SFT)は、大規模言語モデル(LLM)を下流のタスクストリームに適用するためのデファクトレシピである。最近の研究は、オンラインの信号 -- モデル自身のアウトプットのトレーニング -- が、外部の監視よりも確実に忘れを減らしていることを示している。既存のオンライン手法では、この信号を新たな学習目標(例えば、教師のコピーによる自己蒸留損失)を通じてルーティングし、教師から追加のフォワードパス、スケジュール感度、スタイリスティックドリフトを継承する。提案手法であるOn-Policy Replay (OPR) では,履歴的プロンプトの小さな予算で最新のチェックポイントをロールアウトし,タスク報酬によって世代をフィルタリングし,生き残った (prompt, model response) ペアを通常の SFT の例として再生する。教師はおらず、補助的な損失もなく、飛行中の蒸留も無い。 TRACE連続学習ベンチマークにおける7--8B命令チューニングバックボーン(Qwen2.5-7B-インストラクト、Qwen3-8B、Llama3.1-8B-インストラクト)のうち、OPRは、最も鋭いストレステスト(Qwen2.5-7B-インストラクト、Sequential SFT BWT -13.93)において、BWTを10%のリプレイ予算で-0.65に、1%の予算で-2.29に引き上げる。我々は, 単一軸上にOPRおよび先行のオンライン蒸留法を配置するKL収縮解釈を行い, バニラ再生がすでに強いベースラインである理由を述べる。低スコアリプレイはバニラ再生よりも均一に悪いため, OPRの有効成分は応答品質のみではなく, オン政治分布であることを示す。我々のコードはhttps://github.com/Yancey2024/OnPolicyReplayで利用可能である。

論文の概要: On-Policy Replay for Continual Supervised Fine-Tuning

関連論文リスト