4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
- URL: http://arxiv.org/abs/2511.19836v1
- Date: Tue, 25 Nov 2025 02:05:35 GMT
- Title: 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
- Authors: Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen,
- Abstract summary: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems.<n>World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text.<n>We introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency.
- Score: 29.06964332825464
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World [100.68103378427567]
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally.<n>We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world.<n>We further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent.
arXiv Detail & Related papers (2025-12-11T18:59:58Z) - DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling [67.95038177144554]
We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
arXiv Detail & Related papers (2025-12-02T18:24:27Z) - Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation [61.60600246983274]
Existing 3D and 4D approaches typically embed scene geometry into autogressive model for semantic understanding and diffusion model for content generation.<n>We propose Uni4D-LLM, the first unified VLM framework withtemporal awareness for 4D scene understanding and generation.
arXiv Detail & Related papers (2025-09-28T12:06:54Z) - OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling [86.12242953301121]
We introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling.<n>Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions.<n>We establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments.
arXiv Detail & Related papers (2025-09-15T17:59:19Z) - Advances in 4D Generation: A Survey [23.041037534410773]
4D generation enables richer interactive and immersive experiences.<n>Despite rapid progress, the field lacks a unified understanding of 4D representations, generative frameworks, basic paradigms, and the core technical challenges it faces.<n>This survey provides a systematic and in-depth review of the 4D generation landscape.
arXiv Detail & Related papers (2025-03-18T17:59:51Z) - Simulating the Real World: A Unified Survey of Multimodal Generative Models [48.35284571052435]
We present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation.<n>To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework.
arXiv Detail & Related papers (2025-03-06T17:31:43Z) - 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z) - Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer [38.85054820740242]
We present a novel approach for generating high-quality, coherent human videos from a single image.
Our framework combines the strengths of diffusion transformers for capturing global correlations and CNNs for accurate condition injection.
We demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos.
arXiv Detail & Related papers (2024-05-27T17:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.