Related papers: RISE-Video: Can Video Generators Decode Implicit World Rules?

RISE-Video: Can Video Generators Decode Implicit World Rules?

URL: http://arxiv.org/abs/2602.05986v1
Date: Thu, 05 Feb 2026 18:36:10 GMT
Title: RISE-Video: Can Video Generators Decode Implicit World Rules?
Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang,
Abstract summary: We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
Score: 71.92434352963427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

Related papers

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z)
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights [41.385614383367205]
Current models aim to transcend the limitations of single-modality representations by unifying understanding and generation.<n>Their reliance on static single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion.<n>We propose Envision-a causal event progression benchmark for chained text-to-multi-image generation.
arXiv Detail & Related papers (2025-12-01T15:52:31Z)
AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes [63.055387623861094]
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws.<n>We propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction.
arXiv Detail & Related papers (2025-10-12T15:55:44Z)
VideoVerse: How Far is Your T2V Generator from a World Model? [25.155601280571577]
VideoVerse is a benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world.<n>VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions.<n>We perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse.
arXiv Detail & Related papers (2025-10-09T16:18:20Z)
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z)
UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark [35.157850129371525]
Image-to-Video (I2V) generation has become a major focus in the field of video synthesis.<n>Existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency.<n>We propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning.
arXiv Detail & Related papers (2025-09-29T08:14:26Z)
AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences [41.66718802220536]
AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models.<n>We provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features.<n>Experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS.
arXiv Detail & Related papers (2025-08-14T15:55:49Z)
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation [12.843117062583502]
We propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models.<n>To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs)<n>We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos.
arXiv Detail & Related papers (2025-07-24T05:37:08Z)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z)
MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement [47.064467920954776]
We introduce MAGREF, a unified and effective framework for any-reference video generation.<n>Our approach incorporates masked guidance and a subject disentanglement mechanism.<n>Experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-29T17:58:15Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.