Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
- URL: http://arxiv.org/abs/2601.04137v1
- Date: Wed, 07 Jan 2026 17:50:37 GMT
- Title: Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
- Authors: Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang,
- Abstract summary: We introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val)<n>Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization and execution.<n>For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world.
- Score: 62.17144846428715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
Related papers
- Rethinking Video Generation Model for the Embodied World [26.174517437895616]
RBench is designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments.<n> evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors.<n>We introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips.
arXiv Detail & Related papers (2026-01-21T18:59:18Z) - WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World [100.68103378427567]
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally.<n>We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world.<n>We further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent.
arXiv Detail & Related papers (2025-12-11T18:59:58Z) - A Comprehensive Survey on World Models for Embodied AI [14.457261562275121]
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states.<n>This survey presents a unified framework for world models in embodied AI.
arXiv Detail & Related papers (2025-10-19T07:12:32Z) - WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning [52.36434784963598]
We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models.<n>We show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
arXiv Detail & Related papers (2025-06-04T18:22:40Z) - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [74.17234924159108]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>We conduct extensive human annotations to ensure evaluation alignment with human judgment.
arXiv Detail & Related papers (2025-03-27T17:57:01Z) - WorldModelBench: Judging Video Generation Models As World Models [57.776769550453594]
Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving.<n>Current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality.<n>We propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains.
arXiv Detail & Related papers (2025-02-28T03:58:23Z) - WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench.
WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks.
Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z) - Sapiens: Foundation for Human Vision Models [14.72839332332364]
We present Sapiens, a family of models for four fundamental human-centric vision tasks.
Our models support 1K high-resolution inference and are easy to adapt for individual tasks.
We observe that self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks.
arXiv Detail & Related papers (2024-08-22T17:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.