RISE-Video: Can Video Generators Decode Implicit World Rules?
- URL: http://arxiv.org/abs/2602.05986v1
- Date: Thu, 05 Feb 2026 18:36:10 GMT
- Title: RISE-Video: Can Video Generators Decode Implicit World Rules?
- Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang,
- Abstract summary: We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
- Score: 71.92434352963427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
Related papers
- MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z) - Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights [41.385614383367205]
Current models aim to transcend the limitations of single-modality representations by unifying understanding and generation.<n>Their reliance on static single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion.<n>We propose Envision-a causal event progression benchmark for chained text-to-multi-image generation.
arXiv Detail & Related papers (2025-12-01T15:52:31Z) - AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes [63.055387623861094]
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws.<n>We propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction.
arXiv Detail & Related papers (2025-10-12T15:55:44Z) - VideoVerse: How Far is Your T2V Generator from a World Model? [25.155601280571577]
VideoVerse is a benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world.<n>VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions.<n>We perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse.
arXiv Detail & Related papers (2025-10-09T16:18:20Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark [35.157850129371525]
Image-to-Video (I2V) generation has become a major focus in the field of video synthesis.<n>Existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency.<n>We propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning.
arXiv Detail & Related papers (2025-09-29T08:14:26Z) - AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences [41.66718802220536]
AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models.<n>We provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features.<n>Experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS.
arXiv Detail & Related papers (2025-08-14T15:55:49Z) - T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation [12.843117062583502]
We propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models.<n>To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs)<n>We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos.
arXiv Detail & Related papers (2025-07-24T05:37:08Z) - MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z) - MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement [47.064467920954776]
We introduce MAGREF, a unified and effective framework for any-reference video generation.<n>Our approach incorporates masked guidance and a subject disentanglement mechanism.<n>Experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.