Fugu-MT 論文翻訳(概要): MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

論文の概要: MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

arxiv url: http://arxiv.org/abs/2512.06628v1
Date: Sun, 07 Dec 2025 02:28:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.437267
Title: MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
Title（参考訳）: MIND-V:RLを用いたロボットマニピュレーションのための階層ビデオ生成
Authors: Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li,
Abstract要約: 我々は,長距離ロボット操作の論理的コヒーレントなビデオの合成を目的とした階層型フレームワークであるMIND-Vを紹介する。認知科学にインスパイアされたMIND-Vは、高レベルの推論とピクセルレベルの合成を橋渡しする。 MIND-Vは、長距離ロボット操作ビデオ生成における最先端の性能を実証する。
参考スコア（独自算出の注目度）: 20.463231924099567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To align the generated videos with physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA world model to enforce physical plausibility by aligning the predicted and actual dynamic evolutions in the feature space. MIND-V demonstrates state-of-the-art performance in long-horizon robotic manipulation video generation, establishing a scalable and controllable paradigm for embodied data synthesis.
Abstract（参考訳）: 身体的模倣学習は、多種多様な長距離ロボット操作データの不足によって制限される。この領域の既存のビデオ生成モデルは、単純なアクションの短いクリップの合成に限られており、しばしば手動で定義された軌跡に依存している。この目的のために,長軸ロボット操作の物理的かつ論理的コヒーレントなビデオ合成を目的とした階層型フレームワークであるMIND-Vを紹介する。認知科学にインスパイアされたMIND-Vは、3つのコアコンポーネントを通して高レベルの推論とピクセルレベルの合成をブリッジする: タスク計画のために事前訓練されたビジョン言語モデルを活用するセマンティック推論ハブ(SRH)、抽象的な命令をドメイン不変表現に変換する行動セマンティックブリッジ(BSB)、条件付きビデオレンダリングのための運動ビデオジェネレータ(MVG)である。 MIND-Vは、長時間の堅牢性を高めるテストタイム最適化戦略であるStaged Visual Future Rolloutsを採用している。生成した映像を物理法則と整合させるため,新しい物理フォレスト・コヒーレンス(PFC)報酬によって指導されたGRPO強化学習相を導入する。 PFCは、V-JEPA世界モデルを利用して、特徴空間における予測および実際の動的進化を整列させることにより、物理的な可視性を強制する。 MIND-Vは、長距離ロボット操作ビデオ生成における最先端性能を実証し、エンボディドデータ合成のためのスケーラブルで制御可能なパラダイムを確立する。

論文の概要: MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

関連論文リスト