Think With Videos For Agentic Long-Video Understanding
- URL: http://arxiv.org/abs/2506.10821v3
- Date: Fri, 26 Sep 2025 08:22:16 GMT
- Title: Think With Videos For Agentic Long-Video Understanding
- Authors: Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou,
- Abstract summary: Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
- Score: 117.68219930263153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
Related papers
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding [106.23494088118571]
LongVideo-R1 is a multimodal large language model (MLLM) agent for efficient video context navigation.<n>It infers the most informative video clip for subsequent processing.<n>The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm.
arXiv Detail & Related papers (2026-02-24T13:49:47Z) - Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding [38.87967229483403]
Video-TwG is a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm.<n>Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning traces.<n>Our algorithm features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.
arXiv Detail & Related papers (2026-02-21T03:16:23Z) - LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering [14.039561301034848]
Cross-video question answering presents significant challenges beyond traditional single-video understanding.<n>We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning.<n>Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training.
arXiv Detail & Related papers (2025-08-05T03:33:24Z) - Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [60.88843818016968]
Long-form video understanding presents significant challenges due to temporal-spatial complexity and difficulty of question answering.<n>We propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%.
arXiv Detail & Related papers (2025-05-23T16:37:36Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos.<n>We propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements.
arXiv Detail & Related papers (2024-12-04T11:47:57Z) - SEAL: Semantic Attention Learning for Long Video Representation [31.994155533019843]
This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos.<n>To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities.<n>Our representation is versatile and applicable across various long video understanding tasks.
arXiv Detail & Related papers (2024-12-02T18:46:12Z) - MLVU: Benchmarking Multi-task Long Video Understanding [28.35597611731375]
We propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU.<n> MLVU presents the following critical values: textit1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations.<n>The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique.
arXiv Detail & Related papers (2024-06-06T17:09:32Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Understanding Long Videos with Multimodal Language Models [44.78900245769057]
Large Language Models (LLMs) have allowed recent approaches to achieve excellent performance on long-video understanding benchmarks.<n>We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance.<n>Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z) - LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models [86.85389322710674]
This work presents an early and holistic evaluation of Large Vision-Language Models (LVLMs)
It proposes a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub.
It provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
arXiv Detail & Related papers (2023-08-07T17:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.