Fugu-MT 論文翻訳(概要): Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

論文の概要: Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.11087v1
Date: Tue, 09 Jun 2026 16:45:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.621221
Title: Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning
Title（参考訳）: 強化学習におけるフローポリシーの試験時間勾配誘導
Authors: Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine,
Abstract要約: 表現的連続制御ポリシは、シミュレーションされた実ロボット制御のための模倣学習のスケーリングにおける進歩のバックボーンを形成する。テスト時に完全にポリシー最適化を行うRLアルゴリズムであるQGF(Q-Guided Flow)を提案する。実証的には、QGFはシングルタスクおよびゴール条件のオフラインRLベンチマークにおいて、以前のテスト時間RLメソッドよりも優れている。
参考スコア（独自算出の注目度）: 50.738952715864116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.
Abstract（参考訳）: 拡散や流れモデルなどの表現力豊かな連続制御ポリシは、シミュレーションされた実ロボット制御のための模倣学習のスケーリングにおける最近の進歩のバックボーンを形成する。教師付き模倣学習環境で安定してスケールすることが知られているが、政策改善のために強化学習(RL)パイプラインに組み込むことはより困難であることが判明した。しばしば、専門的なトレーニングの目標や、デノベーションプロセスによるバックプロパゲーションを必要とし、安定性とスケーラビリティへの影響でよく知られた問題を引き起こします。本稿では,安定的な政策訓練をそのまま残しながら,テスト時にのみ簡単な政策改善策が,これらの課題を克服する競争上の代替手段となるかどうかを考察する。そこで本研究では,テスト時に完全にポリシー最適化を行うRLアルゴリズムであるQGF(Q-Guided Flow)を提案する。 QGFは(標準的な行動クローニングの目的を通じて)参照フローポリシーとバリュー関数の批判の両方を事前学習し、テスト時には、値勾配を使用して参照ポリシーをガイドし、追加のポリシー学習なしで高価値アクションを生成する。実証的には、QGFはシングルタスクとゴール条件のオフラインRLベンチマークで以前のテストタイムRL法よりも高次元のアクション空間で優れており、最先端のトレーニングタイムアルゴリズムと競合するが、実行はずっと安価である。さらに、アクター批判訓練の不安定さを回避し、表現的ポリシーを備えた実用的で効果的な代替RLアルゴリズムを提供することにより、モデルサイズで良好なスケーリングを実現する。

論文の概要: Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

関連論文リスト