Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
- URL: http://arxiv.org/abs/2603.00512v1
- Date: Sat, 28 Feb 2026 07:18:07 GMT
- Title: Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
- Authors: Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng,
- Abstract summary: Current methods typically select frames with high relevance to a given query.<n>We introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework.<n>WFS-SB significantly boosts LVLM performance, improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench.
- Score: 43.587729230845525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
Related papers
- Event-Anchored Frame Selection for Effective Long-Video Understanding [67.56884568828508]
Event-Anchored Frame Selection (EFS) is a hierarchical, event-aware pipeline.<n>As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs.
arXiv Detail & Related papers (2026-03-01T08:25:37Z) - K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding [38.06179287702453]
K-frames is a novel paradigm for scene-driven selection that preserves temporal continuity.<n>Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips.<n>K-frames provides an effective, interpretable, and plug-and-play solution for selection at various scales.
arXiv Detail & Related papers (2025-10-14T06:23:22Z) - From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding [1.3856027745141806]
KeyScore is a caption-aware frame scoring method that combines semantic similarity to captions, temporal representativeness, and contextual drop impact.<n>Our method enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.
arXiv Detail & Related papers (2025-10-07T23:02:27Z) - From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z) - LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning [73.90466023069125]
We propose LOVE-R1, a model that can adaptively zoom in on a video clip.<n>The model is first provided with densely sampled frames but in a small resolution.<n>If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution.
arXiv Detail & Related papers (2025-09-29T13:43:55Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration [24.337139909108117]
"Less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution.<n>"Visual echoes" yield significant temporal redundancy, which we term 'visual echoes'<n>"AFP" employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives.<n>Our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%.
arXiv Detail & Related papers (2025-08-05T11:31:55Z) - Enhancing Long Video Generation Consistency without Tuning [92.1714656167712]
We address issues to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix.<n>For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt quality.<n>Inspired by our analyses, we propose PromptBlend, an advanced prompt pipeline that systematically aligns the prompts.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.