Fugu-MT 論文翻訳(概要): RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

論文の概要: RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

arxiv url: http://arxiv.org/abs/2604.19092v1
Date: Tue, 21 Apr 2026 05:09:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.631504
Title: RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
Title（参考訳）: RoboWM-Bench:ロボットマニピュレーションにおける世界モデル評価ベンチマーク
Authors: Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu,
Abstract要約: RoboWM-Benchは、ビデオワールドモデルの評価のための操作中心のベンチマークである。我々は、最先端のビデオワールドモデルを評価し、物理的に実行可能な動作を確実に生成することは、未解決の課題である。
参考スコア（独自算出の注目度）: 23.57524297963567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.
Abstract（参考訳）: 大規模ビデオワールドモデルの最近の進歩は、ますます現実的な未来予測を可能にし、ロボット学習に想像ビデオを活用する可能性を高めている。しかし、視覚的リアリズムは物理的妥当性を示唆せず、生成されたビデオから推測される行動は、エンボディエージェントによって実行されるときのダイナミックスや失敗に反する可能性がある。既存のベンチマークでは、物理的な可視性の概念が取り入れられ始めているが、それらは主に知覚や診断指向であり、予測された振る舞いが意図されたタスクを完了させる実行可能なアクションに変換できるかどうかを体系的に評価していない。このギャップに対処するため,ビデオワールドモデル評価のための操作中心ベンチマークであるRoboWM-Benchを紹介する。 RoboWM-Benchは、人間の手とロボットの操作ビデオから生成された振る舞いを、具体化されたアクションシーケンスに変換し、ロボットの実行を通じて検証する。このベンチマークは様々な操作シナリオにまたがり、一貫した再現可能な評価のための統一されたプロトコルを確立する。 RoboWM-Benchを用いて、最先端のビデオワールドモデルを評価し、物理的に実行可能な動作を確実に生成することは、未解決の課題である。一般的な障害モードには、空間的推論におけるエラー、不安定な接触予測、非物理的変形が含まれる。操作データの微調整は改善をもたらすが、物理的な不整合は引き続き持続し、ロボットのためのより物理的に基盤付けられたビデオ生成の機会を示唆している。

論文の概要: RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

関連論文リスト