Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
- URL: http://arxiv.org/abs/2505.18079v4
- Date: Mon, 03 Nov 2025 08:39:35 GMT
- Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
- Authors: Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu,
- Abstract summary: Long-form video understanding presents significant challenges due to temporal-spatial complexity and difficulty of question answering.<n>We propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%.
- Score: 60.88843818016968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
Related papers
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding [106.23494088118571]
LongVideo-R1 is a multimodal large language model (MLLM) agent for efficient video context navigation.<n>It infers the most informative video clip for subsequent processing.<n>The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm.
arXiv Detail & Related papers (2026-02-24T13:49:47Z) - Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning [32.71093573332936]
VideoDR is the first video deep research benchmark for studying video agents in open-web settings.<n>VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence.
arXiv Detail & Related papers (2026-01-11T15:07:37Z) - CAViAR: Critic-Augmented Video Agentic Reasoning [90.48729440775223]
We ask: can perception capabilities be leveraged to perform more complex video reasoning?<n>We develop a large language model agent given access to video modules as subagents or tools.<n>We show that the combination of our agent and critic achieve strong performance on datasets.
arXiv Detail & Related papers (2025-09-09T17:59:39Z) - Infinite Video Understanding [50.78256932424239]
We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia.<n>We outline the core challenges and key research directions towards achieving this transformative capability.
arXiv Detail & Related papers (2025-07-11T23:07:04Z) - VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT [31.413204839972984]
We propose a specialized chain-of-thought (CoT) process tailored for long video analysis.<n>Our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs.<n>We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design.
arXiv Detail & Related papers (2025-04-06T13:03:34Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT)<n>VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story.<n>AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands.
We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries.
It features an Divide-and-Conquer Loop capable of autonomous reasoning.
We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z) - Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
arXiv Detail & Related papers (2023-10-19T13:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.