Fugu-MT 論文翻訳(概要): How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

論文の概要: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

arxiv url: http://arxiv.org/abs/2604.25907v2
Date: Thu, 07 May 2026 17:16:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.258853
Title: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
Title（参考訳）: モデルがスーパービジョンにどれくらいの速さでコミットすべきか? Tsallis Loss Continuumにおける推論モデルのトレーニング
Authors: Chu-Cheng Lin, Eugene Ie,
Abstract要約: RLVR と log-marginal-likelihood を補間する統合損失ファミリ J_Q$ を示す。すべてのメンバは、学習率とは独立して各インスタンスを再重み付けするインスタンス毎の$P_q$でのみ異なる、サンプル毎の勾配方向を共有する。 J_Q$連続体上の固定値q$を直接最適化する2つのモンテカルロ推定子を、注釈付き有理数なしで導出する。
参考スコア（独自算出の注目度）: 3.9929570259734604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_θ^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_θ^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).
Abstract（参考訳）: SFT-then-RLVRは訓練後の推論モデルに広く使われているが、なぜこの特定の順序が、なぜ冷戦開始時にRLVRのみの停止が停止するのかは、統一理論的な説明を欠いている。私たちは、Tsallis $q$-logarithmを使って、統一損失ファミリー$J_Q$でその口座を提供します。 J_Q$ は、RLVR (at $q{=}0$, the \textit{exploitation pole}) と、遅延トラジェクトリ (at $q{=}1$, the \textit{density-estimation pole}) 上の対数的類似性(log-marginal-likelihood) を補間する単一パラメータファミリーで、標準パイプラインがステップワイズ$q{=}1 \to 0$スケジュールに対応する。すべてのメンバーは、各インスタンスを学習率とは独立に重み付けする、インスタンスごとの増幅$P_θ^{-q}$でのみ異なる、サンプル毎の勾配方向を共有する。勾配流解析では、悪用極はコールドスタートを逃れるためには$Ω(\frac{1}{p_0})$時間を要するが、ラベルノイズに対して頑健であり、密度推定極は$ \big(\log(\frac{1}{p_0})\big)$で逃げるが、ラベルノイズを記憶する。この分離は、SFT(q{=}1$)がまず、SFT-then-RLVRパラダイムの下で、より堅牢なRLVR(q{=}0$)によって、コールドスタート体制からモデルを動かす方法を説明する。さらに、GARL(Gradient-Amplified RL)とPAFT(Posterior-Attenuated Fine-Tuning)の2つのモンテカルロ推定器を、共有バイアス$O\big(\frac{q}{M P_θ^q}\big)$と共有バイアス$O\big(\frac{q}{M P_θ^q}\big)$で導出する。 FinQA、HotPotQA、MuSiQueでは、十分な高額のGARLはコールドスタート停止を実質的に緩和し、GRPOが完全に失敗するコールドスタートを逃れる。 HotPotQA と MuSiQue では PAFT は $q{=}0.75$ であり、HotPotQA (+13.9$ over GRPO) では 47.9$ \texttt{m@16} となる。

論文の概要: How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

関連論文リスト