K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
- URL: http://arxiv.org/abs/2510.13891v1
- Date: Tue, 14 Oct 2025 06:23:22 GMT
- Title: K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
- Authors: Yifeng Yao, Yike Yun, Jing Wang, Huishuai Zhang, Dongyan Zhao, Ke Tian, Zhihao Wang, Minghui Qiu, Tao Wang,
- Abstract summary: K-frames is a novel paradigm for scene-driven selection that preserves temporal continuity.<n>Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips.<n>K-frames provides an effective, interpretable, and plug-and-play solution for selection at various scales.
- Score: 38.06179287702453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in image understanding, but long-video are constrained by context windows and computational cost. Uniform frame sampling often leads to substantial information loss. Meanwhile existing keyframe selection methods such as text-frame retrieval or RL-based frame optimization typically yield sparse and temporally disjointed frames, overlooking scene continuity and lacking flexibility for multi-scale frame selection. To address these limitations, we introduce K-frames, a novel paradigm for scene-driven keyframe selection that preserves temporal continuity. Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips, which enables any-k keyframes selection to meet diverse user budgets. To achieve this approach, we first introduce PeakClips, a dataset of 200K video highlights conditioned by query. Building on this dataset, K-frames learns clip2frame selection using a three-stage progressive curriculum. It involves two Supervised Fine-Tuning stages for temporal grounding and key-clip perception, followed by a Reinforcement Learning stage that directly optimizes the scene-driven prediction policy for downstream task without further annotations. Extensive experiments on major long-video understanding benchmarks demonstrate that K-frames provides an effective, interpretable, and plug-and-play solution for keyframe selection at various scales. Our dataset and model will be available.
Related papers
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z) - FOCUS: Efficient Keyframe Selection for Long Video Understanding [26.44459739499484]
Multimodal large language models (LMLMs) represent images and video frames as visual tokens.<n> FOCUS, Frame-Optimistic Confidence Upperbound Selection, is a model-agnostic selection module that selects frames under a strict token budget.<n>For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBenching benchmarks.
arXiv Detail & Related papers (2025-10-31T08:41:13Z) - From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding [1.3856027745141806]
KeyScore is a caption-aware frame scoring method that combines semantic similarity to captions, temporal representativeness, and contextual drop impact.<n>Our method enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.
arXiv Detail & Related papers (2025-10-07T23:02:27Z) - From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - KFFocus: Highlighting Keyframes for Enhanced Video Understanding [33.69757683688046]
We propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames.<n>By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details.<n>We also introduce a multimodal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame.
arXiv Detail & Related papers (2025-08-12T14:57:03Z) - Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders [62.58375366359421]
Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem.<n>Traditional uniform sampling leads to selection of irrelevant content.<n>Post-training MLLMs on thousands of frames imposes a substantial computational burden.<n>We propose threadings with narratives (Nar-KFC) to facilitate effective and efficient long video perception.
arXiv Detail & Related papers (2025-05-30T03:04:28Z) - Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification [1.2335698325757494]
We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks.
The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
arXiv Detail & Related papers (2022-03-25T05:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.