Related papers: FOCUS: Efficient Keyframe Selection for Long Video Understanding

FOCUS: Efficient Keyframe Selection for Long Video Understanding

URL: http://arxiv.org/abs/2510.27280v1
Date: Fri, 31 Oct 2025 08:41:13 GMT
Title: FOCUS: Efficient Keyframe Selection for Long Video Understanding
Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You,
Abstract summary: Multimodal large language models (LMLMs) represent images and video frames as visual tokens.<n> FOCUS, Frame-Optimistic Confidence Upperbound Selection, is a model-agnostic selection module that selects frames under a strict token budget.<n>For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBenching benchmarks.
Score: 26.44459739499484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

Related papers

Event-Anchored Frame Selection for Effective Long-Video Understanding [67.56884568828508]
Event-Anchored Frame Selection (EFS) is a hierarchical, event-aware pipeline.<n>As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs.
arXiv Detail & Related papers (2026-03-01T08:25:37Z)
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding [38.06179287702453]
K-frames is a novel paradigm for scene-driven selection that preserves temporal continuity.<n>Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips.<n>K-frames provides an effective, interpretable, and plug-and-play solution for selection at various scales.
arXiv Detail & Related papers (2025-10-14T06:23:22Z)
From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z)
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z)
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z)
Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer. The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together. The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z)
Deep Unsupervised Key Frame Extraction for Efficient Video Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC) The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.