Related papers: VideoVerse: How Far is Your T2V Generator from a World Model?

VideoVerse: How Far is Your T2V Generator from a World Model?

URL: http://arxiv.org/abs/2510.08398v2
Date: Tue, 21 Oct 2025 16:28:13 GMT
Title: VideoVerse: How Far is Your T2V Generator from a World Model?
Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang,
Abstract summary: VideoVerse is a benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world.<n>VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions.<n>We perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse.
Score: 25.155601280571577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

Related papers

RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark [35.157850129371525]
Image-to-Video (I2V) generation has become a major focus in the field of video synthesis.<n>Existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency.<n>We propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning.
arXiv Detail & Related papers (2025-09-29T08:14:26Z)
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation [12.843117062583502]
We propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models.<n>To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs)<n>We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos.
arXiv Detail & Related papers (2025-07-24T05:37:08Z)
Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z)
T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation.<n>It assesses videos across three branches: text-video consistency, realness, and technical quality.<n>T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z)
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [94.2662603491163]
Existing evaluation protocols primarily focus on temporal consistency and content continuity. We propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models.
arXiv Detail & Related papers (2024-07-01T08:51:22Z)
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation [57.651809298512276]
ChronoMagic-Bench is a text-to-video (T2V) generation benchmark. It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses. We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
arXiv Detail & Related papers (2024-06-26T17:50:47Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.