KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
- URL: http://arxiv.org/abs/2512.14017v1
- Date: Tue, 16 Dec 2025 02:27:05 GMT
- Title: KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
- Authors: Zongyao Li, Kengo Ishida, Satoshi Yamazaki, Xiaotong Ji, Jianquan Liu,
- Abstract summary: We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA)<n>KFS-Bench features multi-scene annotations to enable direct and robust evaluation of sampling strategies.<n>Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance.
- Score: 6.320777997334055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.
Related papers
- Improving Video Question Answering through query-based frame selection [15.416301612152004]
Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content.<n>Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video.<n>We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions.
arXiv Detail & Related papers (2026-01-12T12:10:20Z) - Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - FOCUS: Efficient Keyframe Selection for Long Video Understanding [26.44459739499484]
Multimodal large language models (LMLMs) represent images and video frames as visual tokens.<n> FOCUS, Frame-Optimistic Confidence Upperbound Selection, is a model-agnostic selection module that selects frames under a strict token budget.<n>For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBenching benchmarks.
arXiv Detail & Related papers (2025-10-31T08:41:13Z) - A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering [15.220013605396396]
A.I.R. is a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection.<n>We leverage a powerful Vision-Language Models (VLMs) to perform deep, semantic analysis on complex queries.<n>Our approach significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency.
arXiv Detail & Related papers (2025-10-06T01:51:13Z) - LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning [73.90466023069125]
We propose LOVE-R1, a model that can adaptively zoom in on a video clip.<n>The model is first provided with densely sampled frames but in a small resolution.<n>If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution.
arXiv Detail & Related papers (2025-09-29T13:43:55Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment [51.3011761744484]
Multi-modal Large language models can only process a finite number of frames in a single inference.<n>We propose multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction.<n> Experiments show that this approach covers the correct answer for a high percentage of long video questions.
arXiv Detail & Related papers (2025-03-26T11:53:03Z) - Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z) - End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling [43.024232182899354]
We propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA.
We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video.
The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods.
arXiv Detail & Related papers (2024-07-21T04:09:37Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.