Related papers: WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

URL: http://arxiv.org/abs/2506.04363v1
Date: Wed, 04 Jun 2025 18:22:40 GMT
Title: WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning
Authors: Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, Pascale Fung,
Abstract summary: We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models.<n>We show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
Score: 52.36434784963598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.

Related papers

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models [114.95269118652163]
We introduce WorldArena, a unified benchmark designed to evaluate embodied world models across both perceptual and functional dimensions.<n>WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation.<n>Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability.
arXiv Detail & Related papers (2026-02-09T18:09:20Z)
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test [62.17144846428715]
We introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val)<n>Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization and execution.<n>For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world.
arXiv Detail & Related papers (2026-01-07T17:50:37Z)
From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction [57.56072009935036]
We introduce a new driving paradigm named Policy World Model (PWM)<n>PWM integrates world modeling and trajectory planning within a unified architecture.<n>Our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs.
arXiv Detail & Related papers (2025-10-22T14:57:51Z)
A Comprehensive Survey on World Models for Embodied AI [14.457261562275121]
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states.<n>This survey presents a unified framework for world models in embodied AI.
arXiv Detail & Related papers (2025-10-19T07:12:32Z)
Evaluating Robot Policies in a World Model [54.874926065292904]
We investigate World-model-based Policy Evaluation (WPE)<n>WPE achieves high fidelity in mimicing robot arm movements as in real videos.<n>We show that WPE can serve as a starting point for evaluating robot policies before real-world deployment.
arXiv Detail & Related papers (2025-05-31T15:51:56Z)
RLVR-World: Training World Models with Reinforcement Learning [41.05792054442638]
We present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards.<n>We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation.
arXiv Detail & Related papers (2025-05-20T05:02:53Z)
LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [51.834607121538724]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability [84.52205243353761]
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment.<n>We investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation.
arXiv Detail & Related papers (2025-04-06T20:35:44Z)
Object-Centric World Model for Language-Guided Manipulation [4.008780119020479]
A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics.<n>We propose a world model leveraging object-centric representation space using slot attention, guided by language instructions.<n>Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions.
arXiv Detail & Related papers (2025-03-08T11:17:37Z)
WorldModelBench: Judging Video Generation Models As World Models [57.776769550453594]
Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving.<n>Current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality.<n>We propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains.
arXiv Detail & Related papers (2025-02-28T03:58:23Z)
Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models [0.12499537119440243]
A world model creates a surrogate world to train a controller and predict safety violations by learning the internal dynamic model of systems. We propose foundation world models that embed observations into meaningful and causally latent representations. This enables the surrogate dynamics to directly predict causal future states by leveraging a training-free large language model.
arXiv Detail & Related papers (2024-03-30T20:03:49Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark. Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.