Fugu-MT 論文翻訳(概要): CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

論文の概要: CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

arxiv url: http://arxiv.org/abs/2605.14274v1
Date: Thu, 14 May 2026 02:18:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.581766
Title: CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
Title（参考訳）: CreFlow: Sparse-Reward Embodied Video Diffusion RLのための補正リフロー
Authors: Zhenyang Ni, Yijiang Li, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Zhenfei Yin, Minshuo Chen, Philip Torr, Zhaoran Wang, Qi Zhu,
Abstract要約: 本稿では,ポストトレーニング後のエンボディドビデオ生成モデルに対して,コンポジション制約に基づく報酬モデルを提案する。提案するCreFlowは,2つの鍵となる設計を持つ新しいオンラインRLフレームワークである。
参考スコア（独自算出の注目度）: 56.969946199335716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.
Abstract（参考訳）: 確率代理目的を持つ異種データに基づいてトレーニングされたビデオ生成モデルは、具体的操作における物理的な制約に反する視覚的にもっともらしいロールアウトを生成することができる。強化学習後トレーニングは、VGMに適応するための自然な経路を提供するが、既存のビデオRL報酬は、各ロールアウトを低レベルのビジュアルメトリックに還元することが多い。このギャップを埋めるために,映像生成後トレーニングのための構成制約に基づく報酬モデルを導入し,タスク要求を線形時間論理制約の合成として自動的に定式化し,忠実な報酬と局所的誤り情報を生成ビデオに提示する。これらの報奨信号を用いた高次元映像生成の効率向上を図るため,新しいオンラインRLフレームワークであるCreFlowを提案する。一報酬関連地域へのRL更新を制限し、訓練後無関係地域への摂動を防止する信用認識NFT損失二グループ内の正のサンプルを補正方向の明示的な見積りとして活用し、訓練を安定化し、加速する補正逆流損失。実験の結果、CreFlowは既存の手法よりも人間やシミュレータの成功ラベルに適合し、下流での実行成功率を8つのバイマニュアル操作タスクで23.8ポイント向上することがわかった。

論文の概要: CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

関連論文リスト