Fugu-MT 論文翻訳(概要): Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

論文の概要: Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

arxiv url: http://arxiv.org/abs/2606.08015v1
Date: Sat, 06 Jun 2026 07:10:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.665104
Title: Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies
Title（参考訳）: Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies
Authors: Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu,
Abstract要約: 本稿では,Q-Guided Value-Gradient Matching (Q-VGM) を法外強化学習(RL)法として提案する。 Q-VGMは、生成モデルにおけるフローアライメントの値勾配ビューであるVGG-Flowを活用することで問題を回避している。 LIBEROでは、Q-VGMが75.0%から92.5%に、RoboTwin 2.0では76.4%から87.2%に、実際の2つのテーブルトップタスクでは40.0%から67.5%に上昇している。
参考スコア（独自算出の注目度）: 14.519898493996891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.
Abstract（参考訳）: Q-VGM(Q-Guided Value-Gradient Matching、Q-VGM)は、学習したQ-関数に関する表現的フローマッチングアクションエキスパートを効率的に改善する、細調整型フローマッチング視覚言語アクション(VLA)ポリシーにおける長年にわたる課題に対処する、非政治強化学習(RL)手法である。効果的な改善は、批評家の優先的な(段階的な)情報を活用する必要があるが、これはフローポリシーにとって難しい。なぜなら、その多段階のデノナイジングプロセスを通じて直接値をバックプロパゲートすることは、VLAスケールで数値的に不安定であるのに対して、ポリシーのグラディエントな手法で要求される引き込み可能なアクションの可能性は、反復的なデノナイジングの下では利用できないからである。既存の価値に基づく手法は、フルデノナイジングチェーンを通じてバックプロパゲートし、ポリシーを更新せずにテスト時にのみ批評家を使用するか、ベロシティフィールドを監督することなく、批判によって改善されたアクションをターミナルラベルとして蒸留する。 Q-VGMはこれらの問題を、不安定なエンドツーエンドの目的ではなく、値勾配をデノナイジング時値勾配場に変換するジェネレーティブモデリングにおけるフローアライメントの値勾配ビューであるVGG-Flowを活用して解決する。これはアクションの可能性を必要とせず、デノナイジングチェーンを介してバックプロパゲーションも必要とせず、固定されたリプレイバッファで動作します。批判は、層ごとのアクションインジェクションを備えたコンパクトなRTT機能に対して、アクションに敏感なCal-QLアンサンブルである。 Q-VGMは、数発のSFT pi0.5 VLAから始まり、自己生成したロールアウトデータを活用して、専門家の監督なしにタスクパフォーマンスを大幅に改善する。 LIBEROでは、Q-VGMは75.0%から92.5%に、RoboTwin 2.0では76.4%から87.2%に、実際の2つのテーブルトップタスクでは40.0%から67.5%に上昇し、3つの設定で全ての同じバックボーン、同じ批判的ベースラインを上回っている。

論文の概要: Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

関連論文リスト