Fugu-MT 論文翻訳(概要): Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

論文の概要: Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

arxiv url: http://arxiv.org/abs/2604.07941v1
Date: Thu, 09 Apr 2026 08:00:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.787163
Title: Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Title（参考訳）: 学習後の大規模言語モデル:オフ・ポリティクスとオン・ポリティクスの統一的な視点
Authors: Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu, Liting Zhang, Yuhang Jia, Yanzhe Zhang, Hualong Yu, Zichen Xu, Qicheng Li, Yong Qin,
Abstract要約: ポストトレーニングは、事前訓練された大きな言語モデルをアライメントされ、デプロイ可能なシステムに変える中心になっている。最近の進歩は、教師付き微調整(SFT)、選好最適化、強化学習(RL)、プロセス監督、検証者誘導法、蒸留、多段パイプラインに及んでいる。この調査では、LLMのポストトレーニングはモデル行動に対する構造化された介入として最もよく理解されている。
参考スコア（独自算出の注目度）: 37.29007534251622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.
Abstract（参考訳）: ポストトレーニングは、事前訓練された大規模言語モデル(LLM)をアライメントされたデプロイ可能なシステムに変える中心になっている。最近の進歩は、教師付き微調整(SFT)、選好最適化、強化学習(RL)、プロセス監督、検証者誘導法、蒸留、多段パイプラインに及んでいる。しかし、これらの手法はしばしば断片的な方法で議論され、ラベルや客観的な家族によって組織され、それらが対処する行動的ボトルネックによって議論される。この調査では、LLMのポストトレーニングはモデル行動に対する構造化された介入として最もよく理解されている。本稿では,まず,外部から供給された軌道上での非政治学習と,学習者が生成したロールアウトにおける政治学習という,2つの主要な学習体制を定義した。次に,2つの反復的な役割を通じてメソッドを解釈する – 効果的なサポート拡張 – 有効な動作をより到達可能なものにすると同時に,すでに到達可能な領域内での動作を改善するポリシの再構築 – と,補完的なシステムレベルの役割である行動統合 – ステージやモデル移行間の動作の保存,転送,償却を行う。この観点は主要なパラダイムの統一的な読解をもたらす。 SFTは、拡張またはポリシーのリフォームをサポートするが、優先ベースの方法は、通常、非政治的なリフォームである。オンラインRLは学習者が生成した状態の行動を改善することが多いが、より強力なガイダンスの下では、理解しにくい推論パスを到達させることもできる。蒸留は圧縮だけでなく凝縮としてよく理解され、ハイブリッドパイプラインは調整された多段合成として現れる。全体として、このフレームワークは、訓練後のボトルネックとステージ構成に関する理由の診断に役立つ。

論文の概要: Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

関連論文リスト