Fugu-MT 論文翻訳(概要): ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

論文の概要: ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

arxiv url: http://arxiv.org/abs/2603.23376v2
Date: Fri, 27 Mar 2026 09:50:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.137223
Title: ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment
Title（参考訳）: ABot-PhysWorld:物理アライメントを用いたロボットマニピュレーションのためのインタラクティブワールドファンデーションモデル
Authors: Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu,
Abstract要約: ABot-PhysWorldは14B Diffusion Transformerモデルで、視覚的にリアルで、物理的に可視で、アクション制御可能なビデオを生成する。視覚的品質を維持しながら、非物理的行動を抑制するために、分離された識別器を備えた新しいDPOベースのポストトレーニングフレームワークを使用する。 PBenchとEZSbenchはVeo 3.1とSora v2 Proを上回り、物理的妥当性と軌道整合性を実現している。
参考スコア（独自算出の注目度）: 31.000965640377128
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
Abstract（参考訳）: ビデオベースの世界モデルは、シミュレーションと計画の具現化のための強力なパラダイムを提供するが、最先端のモデルは、一般的な視覚データと、物理法則を無視した可能性に基づく目標のトレーニングのために、オブジェクトの浸透や反重力運動のような、物理的に不確実な操作をしばしば生み出す。 ABot-PhysWorldは14B Diffusion Transformerモデルで、視覚的にリアルで、物理的に可視で、アクション制御可能なビデオを生成する。物理を意識したアノテーションで300万の操作クリップをキュレートしたデータセット上に構築されたこのフレームワークは、DPOベースの新しいポストトレーニングフレームワークと、分離された識別器を使って、視覚的品質を維持しながら不物理的動作を抑える。並列コンテキストブロックは、クロスエボディメント制御のための精密な空間的アクションインジェクションを可能にする。 EZSbenchは,実物と合成未確認のロボット・タスク・シーンの組み合わせを組み合わせた,最初のトレーニング非依存型ゼロショットベンチマークである。物理的リアリズムとアクションアライメントを別々に評価するために、分離されたプロトコルを使用する。 ABot-PhysWorldはPBenchとEZSbenchで新しい最先端のパフォーマンスを実現し、物理的妥当性と軌道整合性でVeo 3.1とSora v2 Proを上回っている。我々はEZSbenchをリリースし、エンボディドビデオ生成における標準化された評価を促進する。

論文の概要: ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

関連論文リスト