Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- URL: http://arxiv.org/abs/2511.04570v1
- Date: Thu, 06 Nov 2025 17:25:23 GMT
- Title: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu,
- Abstract summary: "Thinking with Video" paradigm bridges visual and textual reasoning in a unified temporal framework.<n>Sora-2 is established as a capable reasoner on vision-centric tasks.<n>On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
- Score: 73.4888880112019
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.
Related papers
- MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models [29.7077721906364]
We introduce MathReal, a dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios.<n>MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels.<n>We evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios.
arXiv Detail & Related papers (2025-08-08T04:39:16Z) - MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [41.188841829937466]
MoMentS (Multimodal Mental States) is a benchmark for building socially intelligent multimodal agents.<n>MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories.<n>We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively.
arXiv Detail & Related papers (2025-07-06T15:06:30Z) - All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [70.92907745196153]
MAIA is a benchmark for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input.<n>MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos.
arXiv Detail & Related papers (2025-02-24T09:25:51Z) - Forgotten Polygons: Multimodal Large Language Models are Shape-Blind [55.65083505741497]
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving.<n>Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons.<n>We propose Visually Cued Chain-of-Thought prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams.
arXiv Detail & Related papers (2025-02-21T22:04:09Z) - TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [55.48403691519395]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding.<n>TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.