Related papers: MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

URL: http://arxiv.org/abs/2602.23969v1
Date: Fri, 27 Feb 2026 12:26:34 GMT
Title: MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
Authors: Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang,
Abstract summary: MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
Score: 48.84450712826316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models--despite strong visual fidelity--primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman's rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.

Related papers

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks? [29.804627410258732]
We introduce a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation.<n>Our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard.
arXiv Detail & Related papers (2026-02-05T18:52:48Z)
RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization [20.935028961216325]
ConceptMix++ is a framework that disentangles prompt phrasing from visual generation capabilities.<n>We show that optimized prompts significantly improve compositional generation performance.<n>These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities.
arXiv Detail & Related papers (2025-07-04T03:27:04Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
Unsupervised Video Summarization via Multi-source Features [4.387757291346397]
Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video. We propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content. For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-26T13:12:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.