Fugu-MT 論文翻訳(概要): How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

論文の概要: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

arxiv url: http://arxiv.org/abs/2604.25907v1
Date: Tue, 28 Apr 2026 17:52:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.986827
Title: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Title（参考訳）: モデルがスーパービジョンにどれくらいの速さでコミットすべきか? Tsallis Loss Continuumにおける推論モデルのトレーニング
Authors: Chu-Cheng Lin, Eugene Ie,
Abstract要約: Loss family $J_Q$ RLVR と log-marginal-likelihood を補間する。勾配の2つの因子化から2つのモンテカルロ推定子を導出する。
参考スコア（独自算出の注目度）: 3.9929570259734604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).
Abstract（参考訳）: 初回成功確率$p_0$が小さい場合, 検証可能な報酬(RLVR)からの強化学習の下で, 出力レベルの監視のみを伴って, 新たなタスクに推論モデルを適用することは停止する。 Tsallis $q$-logarithm を用いて、RLVR (at $q{=}0$, exploitation pole) と潜時軌道上の対数的類似性 (at $q{=}1$, density-estimation pole) を補間する損失族 $J_Q$ を定義する。すべてのメンバーは、学習率とは独立に各インスタンスを重み付けするスカラー増幅$P_{θ^{-q}}$でのみ異なる、サンプル毎の勾配方向を共有する。この増幅はコールドスタートの停止に対処するメカニズムである: 勾配流下では、悪用極はコールドスタートを逃れるために$Ω(\frac{1}{p_0})$時間を必要とし、密度推定極は$ \big(\log(\frac{1}{p_0})\big)$; 中間$q$はノイズ記憶から速度を逃す。 P_θ$は難解であるため、勾配の2つの因子から2つのモンテカルロ推定器を導出する。どちらもバイアス$O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL は分散が低く、PAFT は意味的にコヒーレントな勾配を持つ。 FinQA、HotPotQA、MuSiQueでは、$q{=}0.75$のGARLは、GRPOが完全に失敗するコールドスタートの停止を実質的に緩和する。 HotPotQA と MuSiQue では、トレーニング中に GARL は不安定になり、$q{=}0.75$ では PAFT は安定した勾配を提供する(HotPotQA では 47.9 maj@16, $+14.4$ over GRPO)。

論文の概要: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

関連論文リスト