Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
Abstract Overview
This paper introduces Video Active Perception (VAP), a training-free inference-time method for long-form video question answering with vision-language models. VAP uses a lightweight text-conditioned video generation model (CogVideoX) to produce expected latent video dynamics from a small set of uniformly sampled frames combined with the question and answer candidates. It then compares these generated latents against latents encoded from all real frames, selecting the most dissimilar (most "surprising") frames for downstream VLM inference. The method is evaluated on EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, CLEVRER, VideoMME, and MLVU benchmarks, demonstrating improvements in both accuracy and frame efficiency over uniform sampling and prior frame-selection baselines.
Novelty
The key contribution is formulating inference-time video frame selection as active perception, where a pretrained video generation model serves as a prior and keyframes are chosen based on how much real frames deviate from generated expectations in latent space. Unlike prior frame-selection methods, VAP is training-free, does not require captioning models, and performs selection in a single round rather than through iterative agent loops or complex memory structures.
Results
VAP achieves state-of-the-art zero-shot results on five video QA benchmarks: 68.1% on EgoSchema, 81.4% on NExT-QA, 64.6% on ActivityNet-QA, 72.2% on IntentQA, and 40.5% on CLEVRER. It provides up to 5.6× frame efficiency improvement over standard GPT-4o (32 vs. 180 frames on EgoSchema) with comparable or better accuracy, and demonstrates lower latency than GPT-4o mini under matched accuracy settings. Additional experiments show consistent improvements on very long-video benchmarks (VideoMME and MLVU) and stronger performance on temporal, causal, explanatory, and counterfactual reasoning tasks.
Key Points
- VAP selects keyframes by encoding all real frames and generated frames (produced by CogVideoX conditioned on initial frames, question, and answers) into latent space, computing cosine similarity between paired latents, and choosing the most dissimilar real frames for VLM inference.
- The method is training-free, VLM-agnostic, and operates in a single selection round without requiring captioning models, iterative retrieval loops, or complex external memory structures used in prior approaches such as VideoAgent and VideoTree.
- Empirically, VAP improves both benchmark accuracy and frame efficiency across multiple datasets, with particularly strong gains on reasoning-heavy tasks (e.g., 154% relative improvement over VideoTree on CLEVRER explanatory questions) and consistent improvements on very long-video benchmarks.