VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding
- URL: http://arxiv.org/abs/2602.04094v1
- Date: Wed, 04 Feb 2026 00:08:35 GMT
- Title: VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding
- Authors: Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen,
- Abstract summary: VideoBrain is an end-to-end framework that enables Vision-Language Models to adaptively acquire visual information through learned sampling policies.<n>Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals.
- Score: 9.415923244280542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
Related papers
- FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Video Panels for Long Video Understanding [25.560912635941662]
We propose a novel visual prompting strategy specifically designed for long-video understanding.<n>By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.<n>Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing Video-Language Models.
arXiv Detail & Related papers (2025-09-28T08:05:55Z) - Moment Sampling in Video LLMs for Long-Form Video QA [22.638644170177013]
"moment sampling" is a model-agnostic approach that enables the model to select the most relevant frames according to the context of the question.<n>By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs.
arXiv Detail & Related papers (2025-06-18T03:23:56Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.