Fugu-MT 論文翻訳(概要): Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

論文の概要: Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

arxiv url: http://arxiv.org/abs/2605.16154v1
Date: Fri, 15 May 2026 16:33:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 17:44:16.352987
Title: Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
Title（参考訳）: 成果の多様性を学習する:確率的チャンクマスキングによる効率的なVLA RL
Authors: Vaidehi Bagaria, Nikshep Grampurohit, Pulkit Verma,
Abstract要約: 本稿では,軌道毎のチャンクの小さな確率的に選択されたサブセットに勾配を割り当てるGRPOのドロップイン修正である確率的チャンクマスキング(PCM)を提案する。 3つのLIBEROベンチマークでは、PCMは標準GRPOの最終的な成功率と一致し、2.38倍のウォールクロック速度、4.8倍の勾配更新、60%のピークアクティベーションメモリを達成した。
参考スコア（独自算出の注目度）: 5.238545250784642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、視覚言語アクション(VLA)ポリシーを、タスク成功のために直接最適化することで、トレーニングディストリビューションを超えて一般化することができるが、ポストトレーニングは計算コストが高い。自然な反応は、より高速なシミュレータと世界モデルによるロールアウトコレクションの高速化である。 GRPOベースのVLA RLでは、グラデーション計算がステップ毎のウォールタイム時間の約78%を占め、ロールアウトコレクションはわずか21%である。この計算の多くは、学習にはほとんど寄与しないフェーズに費やされているため、グラディエントなコストが支配的です。 GRPOの学習信号は、利点の分散によって駆動される。しかし、GRPOはロールアウト中のすべてのチャンクに同じ利点を割り当てます。結果として、アクター更新計算は、事前トレーニングと教師付き微調整の後にすでに処理されているフェーズを含む、軌道全体にわたって均一に費やされる。本稿では,軌道毎のチャンクの小さな確率的に選択されたサブセットに勾配計算を割り当てるGRPOのドロップイン修正である確率的チャンクマスキング(PCM)を提案する。 PCMは、フェーズごとの勾配分散のためのロールアウト派生プロキシである成功-失敗アクション分散を用いてセマンティックフェーズをスコアし、オンライン更新フェーズレベルの維持確率で固定チャンク予算をサンプリングする。そこで我々は,各位相勾配の分散を,勾配計算が有用な場所を決定する量として定式化し,成功・失敗動作の分散が測定可能なプロキシを提供することを示す。 PCMは報酬モデルや学習評論家を必要としない。 3つのLIBEROベンチマークでは、PCMは標準GRPOの最終的な成功率と一致し、2.38倍のウォールクロック速度、4.8倍の勾配更新、60%のピークアクティベーションメモリ、20%未満のトラジェクトリーチャンクをバックプロパゲートした。

論文の概要: Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

関連論文リスト