Fugu-MT 論文翻訳(概要): Post-Training with Policy Gradients: Optimality and the Base Model Barrier

論文の概要: Post-Training with Policy Gradients: Optimality and the Base Model Barrier

arxiv url: http://arxiv.org/abs/2603.06957v1
Date: Sat, 07 Mar 2026 00:25:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.496301
Title: Post-Training with Policy Gradients: Optimality and the Base Model Barrier
Title（参考訳）: 政策グラディエントによるポストトレーニング:最適化とベースモデルバリア
Authors: Alireza Mousavi-Hosseini, Murat A. Erdogdu,
Abstract要約: 結果とプロセス報酬を伴う線形自己回帰モデルの訓練後評価について検討する。我々は、ポリシー勾配(PG)の変種が、本質的に最小限の報酬クエリ数を持つ1-varepsilon$を実現できることを証明した。
参考スコア（独自算出の注目度）: 27.674563695368665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study post-training linear autoregressive models with outcome and process rewards. Given a context $\boldsymbol{x}$, the model must predict the response $\boldsymbol{y} \in Y^N$, a sequence of length $N$ that satisfies a $γ$ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $α$, a variant of policy gradient (PG) can achieve likelihood $1 - \varepsilon$ with an essentially minimax optimal number of reward queries $\tilde{O}((α^{-1} + \varepsilon^{-1})/γ^2)$. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in $N$ to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in $N$ via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.
Abstract（参考訳）: 結果とプロセス報酬を伴う線形自己回帰モデルの訓練後評価について検討する。文脈 $\boldsymbol{x}$ が与えられたとき、モデルは応答 $\boldsymbol{y} \in Y^N$ を予測しなければならない。基本モデルが非自明な自明な$α$を達成するテストサンプルにおいて、ポリシー勾配の変種(PG)が1-\varepsilon$を、本質的に最小限の報酬クエリ数$\tilde{O}((α^{-1} + \varepsilon^{-1})/γ^2)$で得ることを証明している。しかし、ベースモデルのサポートを超えて、障壁が発生する。結果報酬を用いた後学習後の全体的な予測誤差は、LQ(Likelihood Quantile)と呼ばれるベースモデルの特性によって制御され、PGの変種は、極小であるにもかかわらず、事前学習アルゴリズムによらず、このサポートを超えるために、指数的に$N$の報酬クエリを必要とする可能性があることを証明した。この障壁を克服するために、プロセス報酬モデルを用いた後トレーニングを行い、トークンレベルのLQに依存することで、この設定におけるPG変種が$N$の次元性の呪いを避ける方法を実証する。その過程で,適応学習率(LR)のSGDが統計的学習においてほぼ最適なテスト誤差を達成し,適応学習のPGが可能な限り計算効率の良いオンライン学習においてほぼ最適な誤り数を達成することを証明した。

論文の概要: Post-Training with Policy Gradients: Optimality and the Base Model Barrier

関連論文リスト