TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
- URL: http://arxiv.org/abs/2511.13704v1
- Date: Mon, 17 Nov 2025 18:52:44 GMT
- Title: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
- Authors: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen,
- Abstract summary: TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
- Score: 42.763907973320464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
Related papers
- Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning [6.800544911407401]
GRiP (Guided Reasoning and Perception) is a novel training framework for visual grounded reasoning.<n> GRiP cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways.<n> GRiP achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench.
arXiv Detail & Related papers (2025-11-27T07:18:25Z) - V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models [52.97290143922252]
V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
arXiv Detail & Related papers (2025-11-20T18:59:42Z) - Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z) - More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models [17.431298099935344]
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs)<n>Recent research has sought to extend reasoning to Vision-Language Models (VLMs)<n>Our study uncovers the dual nature of multimodal reasoning, leading to recognition failures on otherwise basic visual questions.<n>We propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories.
arXiv Detail & Related papers (2025-09-30T06:37:47Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles [29.214813685163218]
Release of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models.<n>We track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles.<n>Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning.
arXiv Detail & Related papers (2025-02-03T05:47:04Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.