Fugu-MT 論文翻訳(概要): Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

論文の概要: Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2603.00110v1
Date: Wed, 18 Feb 2026 14:58:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 05:03:20.268771
Title: Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
Title（参考訳）: 予め訓練されたビデオモデルから物理を学ぶ:ロボットマニピュレーションのための連続的・連続的世界相互作用モデル
Authors: Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, Guangrun Wang,
Abstract要約: 我々は、ロボット操作タスクを解決するために、スケーラブルで連続的かつシーケンシャルな世界インタラクションフレームワークであるPhysGenを紹介した。トレーニング済みのビデオモデルを物理シミュレーターのプロキシとして扱うことで、PhysGenは外部環境とロボット動作の間の動的相互作用をモデル化する。本稿では,映像とアクションを共有物理トークンに統合し,離散映像生成と連続ロボット制御のギャップを埋めるマルチモーダル連続表現を提案する。
参考スコア（独自算出の注目度）: 63.04810454548667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $π_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.
Abstract（参考訳）: 大規模ロボットデータの不足は、政策学習のための他のモダリティから基礎モデルの再利用を動機付けている。本稿では,ロボット操作タスクの解決に自己回帰映像生成を活用する,スケーラブルで連続的かつシーケンシャルな世界インタラクションフレームワークであるPhysGenを紹介する。トレーニング済みのビデオモデルを物理シミュレーターのプロキシとして扱うことで、PhysGenは外部環境とロボット動作の間の動的相互作用をモデル化する。本稿では,映像とアクションを共有物理トークンに統合し,離散映像生成と連続ロボット制御のギャップを埋めるマルチモーダル連続表現を提案する。このアプローチにより、オブジェクトの永続性やダイナミックスなどの暗黙的な物理知識を、ビデオプレトレーニングから下流操作へシームレスに転送することが可能となり、効率よく収束するために、因果マスキング、逆運動学、Lookahead Multi-Token Prediction(L-MTP)、キー値キャッシング(KV)を組み込む。 Libero と ManiSkill のベンチマークによる実験の結果、PhysGen は、それぞれ 13.8% と 8.8% のマージンで OpenVLA と WorldVLA を上回り、ロバストなベースラインを一貫して上回っていることが示された。特に現実のシナリオでは、PhysGenは、事前のアクション固有の事前トレーニングを必要とせずに、$π_0$のような大規模アクション事前トレーニングモデルのパフォーマンスと一致し、透明なオブジェクトの把握のような物理的に複雑なタスクにおいて優れた能力を示す。これらの結果は、トレーニング済みビデオジェネレータから身体的直感を抽出し、汎用的なロボット操作を容易にする可能性を検証した。

論文の概要: Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

関連論文リスト