Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
- URL: http://arxiv.org/abs/2512.05774v1
- Date: Fri, 05 Dec 2025 15:03:48 GMT
- Title: Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
- Authors: Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles,
- Abstract summary: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of computation and irrelevant content.<n>We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires queryrelevant evidence directly from pixels.
- Score: 139.83981719664794
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
Related papers
- Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z) - EEA: Exploration-Exploitation Agent for Long Video Understanding [24.45791994592314]
Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information.<n>Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing.<n>We introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance.
arXiv Detail & Related papers (2025-12-03T06:48:36Z) - Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z) - StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [52.55809460075286]
We propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information.<n>We integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events.<n>Our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
arXiv Detail & Related papers (2025-08-03T18:15:42Z) - Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z) - VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR [23.144642468756032]
Current vision-language models (VLMs) produce verbose, redundant outputs that hinder task performance.<n>Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks.<n>VIBE scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility.<n>VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making.
arXiv Detail & Related papers (2025-05-23T03:11:29Z) - Query-Dependent Video Representation for Moment Retrieval and Highlight
Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query.
Recent transformer-based models do not fully exploit the information of a given query.
We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.