Related papers: VIPER: Process-aware Evaluation for Generative Video Reasoning

VIPER: Process-aware Evaluation for Generative Video Reasoning

URL: http://arxiv.org/abs/2512.24952v1
Date: Wed, 31 Dec 2025 16:31:59 GMT
Title: VIPER: Process-aware Evaluation for Generative Video Reasoning
Authors: Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu,
Abstract summary: We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
Score: 64.86465792516658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

Related papers

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z)
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation [52.0601996237501]
Chain-of-Frame (CoF) reasoning enables frame-by-frame visual inference.<n>CoF-T2I integrates CoF reasoning into text-to-image (T2I) generation via progressive visual refinement.<n>Experiments show that CoF-T2I significantly outperforms the base video model.
arXiv Detail & Related papers (2026-01-15T04:33:06Z)
Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model [18.49759441592143]
We introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos.<n>ReACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions.<n>We also present REACT-Bench, a benchmark for generative video distortion evaluation.
arXiv Detail & Related papers (2026-01-07T15:47:14Z)
Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z)
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum [36.360760591731484]
We introduce a framework built on the co-design of evaluation, data, and modeling.<n>First, we establish the Universal Video Retrieval Benchmark (UVRB)<n>Second, guided by UVRB's diagnostics, we introduce a scalable workflow that generates 1.55 million high-quality pairs.<n>Third, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE)
arXiv Detail & Related papers (2025-10-31T15:54:48Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [46.311223206965934]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.