HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
- URL: http://arxiv.org/abs/2512.11534v1
- Date: Fri, 12 Dec 2025 13:10:30 GMT
- Title: HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
- Authors: Yiqing Yang, Kin-Man Lam,
- Abstract summary: Key frame selection in video understanding presents significant challenges.<n>Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole.<n>We propose an end-to-end trainable, task-adaptive framework for frame selection.
- Score: 13.569944737211472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
Related papers
- FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering [0.6107667071306521]
We develop FocusGraph, a framework for question answering over long egocentric videos.<n>We use a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions.<n>We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select textuals from the resulting sequence of clips.
arXiv Detail & Related papers (2026-03-04T18:14:00Z) - Event-Anchored Frame Selection for Effective Long-Video Understanding [67.56884568828508]
Event-Anchored Frame Selection (EFS) is a hierarchical, event-aware pipeline.<n>As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs.
arXiv Detail & Related papers (2026-03-01T08:25:37Z) - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z) - A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering [15.220013605396396]
A.I.R. is a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection.<n>We leverage a powerful Vision-Language Models (VLMs) to perform deep, semantic analysis on complex queries.<n>Our approach significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency.
arXiv Detail & Related papers (2025-10-06T01:51:13Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs [13.306662159600677]
We introduce video QFrame, a novel approach for adaptive frame selection and multi-temporal scaling.<n>Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP.<n>We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2025-06-27T11:30:51Z) - ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding [52.050036778325094]
We introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework.<n>ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames.<n>Our approach consistently improves reasoning performance across multiple video QA benchmarks.
arXiv Detail & Related papers (2025-06-02T03:08:07Z) - M-LLM Based Video Frame Selection for Efficient Video Understanding [60.93714759178143]
We propose a light-weight M-LLM-based frame selection method that adaptively select frames that are more relevant to users' queries.<n>The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering.
arXiv Detail & Related papers (2025-02-27T01:44:13Z) - Exploring the Design Space of Visual Context Representation in Video MLLMs [102.11582556690388]
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks.
Visual context representation refers to the scheme to select frames from a video and further select the tokens from a frame.
In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes.
arXiv Detail & Related papers (2024-10-17T15:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.