Related papers: Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

URL: http://arxiv.org/abs/2510.26802v1
Date: Thu, 30 Oct 2025 17:59:55 GMT
Title: Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Authors: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng,
Abstract summary: We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
Score: 124.00111584020834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

Related papers

A Mechanistic View on Video Generation as World Models: State and Dynamics [43.951972667861575]
This work proposes a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling.<n>By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
arXiv Detail & Related papers (2026-01-22T19:00:18Z)
MMGR: Multi-Modal Generative Reasoning [97.44203203196481]
We introduce MMGR, a principled evaluation framework based on five reasoning abilities.<n> MMGR evaluates generative reasoning across three domains: Abstract Reasoning, Embodied Navigation, and Physical Commonsense.<n>We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image)
arXiv Detail & Related papers (2025-12-16T18:58:04Z)
Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning [18.15310805625469]
We present Know-Show, a new benchmark designed to evaluate multimodal Video-Language Models (Video-LMs)<n>Know-Show unifies reasoning and localization within a single evaluation framework consisting of five scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions.<n>Built from Charades, Action Genome, and Ego4D with 2.5K human-language questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning.<n>To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding
arXiv Detail & Related papers (2025-12-05T08:15:49Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models [52.97290143922252]
V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
arXiv Detail & Related papers (2025-11-20T18:59:42Z)
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models [42.763907973320464]
TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
arXiv Detail & Related papers (2025-11-17T18:52:44Z)
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning [68.75730050161219]
Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks.<n>CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues.<n>Visual Evidence Reward (VER) is a reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence.
arXiv Detail & Related papers (2025-10-07T16:03:33Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs [18.07249962240035]
Video large language models (Video-LLMs) are increasingly integrated into real-world applications that demand grounded multimodal reasoning.<n>Sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts.<n>We propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs.
arXiv Detail & Related papers (2025-06-08T15:00:21Z)
Causality Model for Semantic Understanding on Videos [0.0]
This thesis focuses on the domain of semantic video understanding.<n>It explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA)
arXiv Detail & Related papers (2025-03-16T10:44:11Z)
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning [40.18308199837137]
We introduce a Video S-Temporal Reasoning (V-STa) benchmark to address these shortcomings.<n>We construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs.<n>Experiments from 14 Video-LLMs reveal significant gaps between current Video-LLMs and the needs for robust and consistent consistent reasoning.
arXiv Detail & Related papers (2025-03-14T15:21:44Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation [27.690736225683825]
We introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis.<n>Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions.<n>Experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism.
arXiv Detail & Related papers (2024-11-30T17:40:49Z)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [55.48403691519395]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding.<n>TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
STAR: A Benchmark for Situated Reasoning in Real-World Videos [94.78038233351758]
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning.
arXiv Detail & Related papers (2024-05-15T21:53:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.