Related papers: Event-Anchored Frame Selection for Effective Long-Video Understanding

Event-Anchored Frame Selection for Effective Long-Video Understanding

URL: http://arxiv.org/abs/2603.00983v1
Date: Sun, 01 Mar 2026 08:25:37 GMT
Title: Event-Anchored Frame Selection for Effective Long-Video Understanding
Authors: Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Fei Chao, Rongrong Ji, Xiawu Zheng,
Abstract summary: Event-Anchored Frame Selection (EFS) is a hierarchical, event-aware pipeline.<n>As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs.
Score: 67.56884568828508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Related papers

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding [43.587729230845525]
Current methods typically select frames with high relevance to a given query.<n>We introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework.<n>WFS-SB significantly boosts LVLM performance, improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench.
arXiv Detail & Related papers (2026-02-28T07:18:07Z)
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs [28.026438743789907]
VideoScaffold is a dynamic representation framework designed for streaming video understanding.<n>It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics.<n>The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension.
arXiv Detail & Related papers (2025-12-23T03:33:45Z)
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning [13.569944737211472]
Key frame selection in video understanding presents significant challenges.<n>Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole.<n>We propose an end-to-end trainable, task-adaptive framework for frame selection.
arXiv Detail & Related papers (2025-12-12T13:10:30Z)
From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z)
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs [13.306662159600677]
We introduce video QFrame, a novel approach for adaptive frame selection and multi-temporal scaling.<n>Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP.<n>We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2025-06-27T11:30:51Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding [70.56829394569938]
We propose Frame Selection Augmented Generation (FRAG) to process long inputs without long context LMMs.<n>The core of the selection process is done by scoring each frame independently, which does not require long context processing.<n>We show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding.
arXiv Detail & Related papers (2025-04-24T11:19:18Z)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.