Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
- URL: http://arxiv.org/abs/2510.23473v1
- Date: Mon, 27 Oct 2025 16:10:45 GMT
- Title: Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
- Authors: Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng,
- Abstract summary: Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning.<n>Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
- Score: 20.07360876062324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
Related papers
- Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks [42.11140720884257]
Video models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics.<n>Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity.<n>We introduce VR-Bench -- a benchmark designed to systematically evaluate video models' reasoning capabilities.
arXiv Detail & Related papers (2025-11-19T03:18:29Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting [62.25888935329454]
This paper introduces the concept of thinking with long videos and proposes a novel framework FrameThinker.<n>We show that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames.<n>Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames.
arXiv Detail & Related papers (2025-09-29T05:36:58Z) - Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models [50.42183477287337]
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning.<n>We introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT)<n>We show that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm.
arXiv Detail & Related papers (2025-07-14T03:21:13Z) - VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [18.9270920369958]
Long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks.<n>Recent efforts have proposed benchmarks aimed at video reasoning, but tasks are often knowledge-driven and do not rely heavily on visual content.<n>We introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning.
arXiv Detail & Related papers (2025-05-29T11:33:43Z) - StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models [39.61402609070949]
Video Affective Reasoning (or Video Affective Reasoning) is a framework to predict and explain how a video would make a human feel.<n>We propose Stimuli-ML for Video Affective Reasoning (or Video Affective Reasoning) with Multi Large Language Models (LMLM)<n>We show that Stimuli-ML is superior to existing MLLMs in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.
arXiv Detail & Related papers (2024-08-31T00:00:50Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)<n>Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Video Understanding with Large Language Models: A Survey [107.7736911322462]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.<n>The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.<n>This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.