Related papers: WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

URL: http://arxiv.org/abs/2512.10958v1
Date: Thu, 11 Dec 2025 18:59:58 GMT
Title: WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Authors: Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu,
Abstract summary: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally.<n>We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world.<n>We further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent.
Score: 100.68103378427567
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

Related papers

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models [114.95269118652163]
We introduce WorldArena, a unified benchmark designed to evaluate embodied world models across both perceptual and functional dimensions.<n>WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation.<n>Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability.
arXiv Detail & Related papers (2026-02-09T18:09:20Z)
Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks [43.59401259468559]
We argue that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation.<n>This work aims to guide future research toward more general, robust, and principled models of the world.
arXiv Detail & Related papers (2026-02-02T04:42:44Z)
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models [17.757245394765807]
We introduce WorldBench, a video-based benchmark specifically designed for concept-specific, disentangled evaluation.<n>WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models.
arXiv Detail & Related papers (2026-01-29T05:31:02Z)
Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models [4.52033729546524]
A world model is an AI system that simulates how an environment evolves under actions.<n>Current world models suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics.<n>We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making.
arXiv Detail & Related papers (2026-01-21T23:35:33Z)
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving [49.11389494068169]
We present DrivingGen, the first comprehensive benchmark for generative driving world models.<n>DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources.<n>General models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality.
arXiv Detail & Related papers (2026-01-04T13:36:21Z)
4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models [29.06964332825464]
World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems.<n>World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text.<n>We introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency.
arXiv Detail & Related papers (2025-11-25T02:05:35Z)
A Step Toward World Models: A Survey on Robotic Manipulation [58.8419978790227]
We look at approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation.<n>We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a fully realized world model should possess.
arXiv Detail & Related papers (2025-10-31T00:57:24Z)
Clone Deterministic 3D Worlds with Geometrically-Regularized World Models [16.494281967592745]
World models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings.<n>Despite rapid progress, current world models remain brittle and degrade over long horizons.<n>We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space.
arXiv Detail & Related papers (2025-10-30T17:56:43Z)
A Comprehensive Survey on World Models for Embodied AI [14.457261562275121]
Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states.<n>This survey presents a unified framework for world models in embodied AI.
arXiv Detail & Related papers (2025-10-19T07:12:32Z)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability [84.52205243353761]
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment.<n>We investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation.
arXiv Detail & Related papers (2025-04-06T20:35:44Z)
WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z)
Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models [51.891804790725686]
Elements of World Knowledge (EWoK) is a framework for evaluating language models' understanding of conceptual knowledge underlying world modeling.<n>EWoK-core-1.0 is a dataset of 4,374 items covering 11 world knowledge domains.<n>All tested models perform worse than humans, with results varying drastically across domains.
arXiv Detail & Related papers (2024-05-15T17:19:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.