Fugu-MT 論文翻訳(概要): PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

論文の概要: PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

arxiv url: http://arxiv.org/abs/2603.21383v1
Date: Sun, 22 Mar 2026 19:59:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.391223
Title: PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
Title（参考訳）: PivotRL: 低コストで高精度なエージェントポストトレーニング
Authors: Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan,
Abstract要約: 長距離エージェントタスクのポストトレーニングは、計算効率と一般化の間に緊張関係がある。本稿では,既存のSFTトラジェクトリで動作する新しいフレームワークであるPivotRLを紹介し,SFTの計算効率とE2E RLのOOD精度を組み合わせた。 PivotRLはNVIDIAのNemotron-3-Super-120B-A12Bで採用され、量産規模のエージェント・ポストトレーニングにおけるワークホースとして機能している。
参考スコア（独自算出の注目度）: 22.906887375657664
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.
Abstract（参考訳）: 長距離エージェントタスクのポストトレーニングは、計算効率と一般化の間に緊張関係がある。教師付き微調整(SFT)は計算効率が高いが、しばしばドメイン外劣化(OOD)に悩まされる。逆に、エンド・ツー・エンド強化学習(E2E RL)はOOD能力を保っているが、多くのオン・ポリシーのロールアウトのために高い計算コストを発生させる。本稿では,既存のSFTトラジェクトリで動作する新しいフレームワークであるPivotRLを紹介し,SFTの計算効率とE2E RLのOOD精度を組み合わせた。 PivotRLは2つの主要なメカニズムに依存している: 第一に、ローカル、オン・ポリシーのロールアウトとピボットのフィルタを実行する: サンプルアクションが結果に高いばらつきを示す情報中間旋回、第二に、SFTデータデモと厳密な文字列マッチングを要求するのではなく、機能等価アクションに対する報酬を利用する。これらのメカニズムは,学習課題とは無関係な行動に基づいて,政策確率を最大に保ちながら,高い自然勾配ノルムを持つ強い学習信号を動機付けることを理論的に示す。同一データにおける標準的なSFTと比較して、PivotRLは4つのエージェントドメインの平均でドメイン内精度+4.17%、非エージェントタスクでは+10.04%高いOOD精度を実現している。特に、エージェントコーディングタスクでは、PivotRLはE2E RLとの競合精度を4倍のロールアウトターンで達成している。 PivotRLはNVIDIAのNemotron-3-Super-120B-A12Bで採用され、量産規模のエージェント・ポストトレーニングにおけるワークホースとして機能している。

論文の概要: PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

関連論文リスト