LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- URL: http://arxiv.org/abs/2511.20785v1
- Date: Tue, 25 Nov 2025 19:22:48 GMT
- Title: LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- Authors: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing,
- Abstract summary: LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
- Score: 87.30445183793871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
Related papers
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding [106.23494088118571]
LongVideo-R1 is a multimodal large language model (MLLM) agent for efficient video context navigation.<n>It infers the most informative video clip for subsequent processing.<n>The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm.
arXiv Detail & Related papers (2026-02-24T13:49:47Z) - Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding [38.87967229483403]
Video-TwG is a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm.<n>Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning traces.<n>Our algorithm features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.
arXiv Detail & Related papers (2026-02-21T03:16:23Z) - VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding [9.415923244280542]
VideoBrain is an end-to-end framework that enables Vision-Language Models to adaptively acquire visual information through learned sampling policies.<n>Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals.
arXiv Detail & Related papers (2026-02-04T00:08:35Z) - A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos [76.98722001848493]
LongShOTBench is a diagnostic benchmark for long-form multimodal video understanding.<n>It includes open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use.<n>LongShOTAgent is an agentic system that analyzes long videos via preprocessing, search, and iterative refinement.
arXiv Detail & Related papers (2025-12-18T18:59:27Z) - Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [25.675553077419274]
Multimodal Language Models (MLLMs) have demonstrated significant progress in vision tasks, yet they still face challenges when processing long-duration inputs.<n>We propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z) - ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - Scaling RL to Long Videos [115.96341152407008]
LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively.<n>LongVILA-R1-7B supports processing up to 8,192 video frames per video, and FPS settings.<n>We release our training system for public availability that supports RL training on various modalities.
arXiv Detail & Related papers (2025-07-10T17:47:40Z) - VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.