AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
- URL: http://arxiv.org/abs/2510.02778v1
- Date: Fri, 03 Oct 2025 07:19:34 GMT
- Title: AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
- Authors: Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun,
- Abstract summary: We propose AdaRD-Key, a training-free sampling module for query-driven long-form video understanding.<n>To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism.<n>Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing vision-language models.
- Score: 31.685368980481968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.
Related papers
- FOCUS: Efficient Keyframe Selection for Long Video Understanding [26.44459739499484]
Multimodal large language models (LMLMs) represent images and video frames as visual tokens.<n> FOCUS, Frame-Optimistic Confidence Upperbound Selection, is a model-agnostic selection module that selects frames under a strict token budget.<n>For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBenching benchmarks.
arXiv Detail & Related papers (2025-10-31T08:41:13Z) - Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z) - Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z) - Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation [20.67434288227437]
ViLAMP is a hierarchical video-language model that processes hour-long videos at "mixed precision"<n>ViLAMP retains full information ins while reducing non-keyframes to their most salient features, resembling mixed-precision training.<n> Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2025-04-03T09:55:09Z) - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information.<n>We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.