Fugu-MT 論文翻訳(概要): $τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

論文の概要: $τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2606.01027v1
Date: Sun, 31 May 2026 05:35:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:50:16.085185
Title: $τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation
Title（参考訳）: $τ_0$-WM:ロボットマニピュレーションのための統合ビデオアクションワールドモデル
Authors: Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo,
Abstract要約: 政策学習,映像予測,行動評価を統合した統合ビデオアクション世界モデルを提案する。このモデルは、実際のロボット遠隔操作で約27,300ドル(約2万2000円)で訓練されている。
参考スコア（独自算出の注目度）: 45.040666672458634
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.
Abstract（参考訳）: ロボット操作は、物理的な実行前に将来の結果を予測し評価しながら実行可能なアクションを生成するモデルを必要とする。我々は、政策学習、ビデオ予測、行動評価を1つの未来予測フレームワークに統合した統合されたビデオアクション世界モデルであるτ_0$-World Model(τ_0$-WM)を提示する。共有ビデオ拡散バックボーン上に構築された$τ_0$-WMは2つの補完インターフェイスを提供する。まず、多視点観察、言語指示、ロボット状態から、将来の視覚的潜伏と連続的な行動チャンクを共同で予測する。第二に、アクション条件付きビデオシミュレータは、候補となるアクションチャンクを多視点にロールアウトし、タスクプログレススコアの高密度化を予測する。このモデルは、実際のロボット遠隔操作、UMIスタイルのインタラクション、エゴセントリックな人間ビデオ、そしてモード固有の監視マスクを使ったロールアウトまたは失敗トラジェクトリで、約27ドル、300ドル(約2万2000円)でトレーニングされている。推論時に、$τ_0$-WMは、アクション候補のサンプリングにテスト時間計算を使用し、それらを再デノナイズ一貫性でランク付けし、低品質候補に対するシミュレータベースの修正を実行する。難易度の高い細粒度ロボット操作タスクにおいて、$τ_0$-WMは、他の関連するベースラインよりも優れた性能を示す。

論文の概要: $τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

関連論文リスト