Fugu-MT 論文翻訳(概要): QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

論文の概要: QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arxiv url: http://arxiv.org/abs/2606.14801v1
Date: Thu, 11 Jun 2026 18:22:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:32.240407
Title: QPILOTS: Efficient Test-Time Q-Steering for Flow Policies
Title（参考訳）: QPILOTS: フローポリシのための効率的なテスト時間Q-Steering
Authors: Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart,
Abstract要約: QPILOTSは、元のポリシーを変更せずに、推論時にデノナイジングプロセスを操る方法である。標準のオフライン-オンラインRLベンチマークでは、QPILOTSが最高の集計性能を達成し、50タスクで平均90%の成功率に達する。
参考スコア（独自算出の注目度）: 20.217020870532686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.
Abstract（参考訳）: フローマッチングと拡散ポリシは表現力のあるアクションジェネレータであるが,時間差強化学習(RL)でそれらを最適化することは依然として困難である。効果的なポリシー抽出には、批評家の行動勾配を利用する必要があるが、この信号を多段階のデノナイジングプロセスで直接バックプロパゲートすることは、数値的に不安定である。既存の手法は、勾配情報を捨て、より単純なワンステップアクターにポリシーを蒸留するか、あるいは批判が改善するにつれて、装飾ポリシーを何度も微調整する。提案するQPILOTS(QPILOTS)は,当初の方針を変更せずに,推論時にデノナイズ処理を行う手法である。各段階において、批判的予測が信頼できないノイズのある中間動作について、批判者を評価する代わりに、まず、最終的な清潔な行動の推定を中間状態にし、そこでの批判的勾配を計算することを計画する。 QPILOTS-Uは高速な単一点近似を用いており、QPILOTS-Mは学習補助ネットワークを介して異なる後部サンプルを描画する。標準のオフライン-オンラインRLベンチマークでは、QPILOTSは最高の集計性能を達成し、50タスクで平均90%の成功率に達する。また、QPILOTSを用いて、シミュレーション中の6つの操作タスクにまたがる事前推論時間アプローチよりも優れた、凍結、事前訓練されたビジョンランゲージアクション(VLA)の基礎モデルを操る。

論文の概要: QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

関連論文リスト