Frame Sampling Strategies Matter: A Benchmark for small vision language models
- URL: http://arxiv.org/abs/2509.14769v1
- Date: Thu, 18 Sep 2025 09:18:42 GMT
- Title: Frame Sampling Strategies Matter: A Benchmark for small vision language models
- Authors: Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi,
- Abstract summary: We propose the first frame-accurate benchmark of state-of-the-art small vision language models for video question-answering.<n>Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques.
- Score: 3.719563722270237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Comparing vision language models on videos is particularly complex, as the performances is jointly determined by the model's visual representation capacity and the frame-sampling strategy used to construct the input. Current video benchmarks are suspected to suffer from substantial frame-sampling bias, as models are evaluated with different frame selection strategies. In this work, we propose the first frame-accurate benchmark of state-of-the-art small VLMs for video question-answering, evaluated under controlled frame-sampling strategies. Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques. By open-sourcing our benchmarking code, we provide the community with a reproducible and unbiased protocol for evaluating video VLMs and emphasize the need for standardized frame-sampling strategies tailored to each benchmarking dataset in future research.
Related papers
- Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection [8.294763803639391]
Video Anomaly Detection (VAD) is critical for surveillance and public safety.<n>Existing benchmarks are limited to either frame-level or video-level tasks.<n>This work introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage.
arXiv Detail & Related papers (2025-09-15T05:48:22Z) - Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models [51.67019924750931]
Video-LevelGauge is a benchmark designed to assess positional bias in large video language models (LVLMs)<n>We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types.<n>Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions.
arXiv Detail & Related papers (2025-08-27T07:58:16Z) - ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding [52.050036778325094]
We introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework.<n>ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames.<n>Our approach consistently improves reasoning performance across multiple video QA benchmarks.
arXiv Detail & Related papers (2025-06-02T03:08:07Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval [1.6581184950812533]
We investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions.
Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern.
arXiv Detail & Related papers (2024-07-22T11:44:08Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - MGSampler: An Explainable Sampling Strategy for Video Action Recognition [30.516462193231888]
We present an explainable, adaptive, and effective frame sampler, called Motion-guided Sampler (MGSampler)
Our basic motivation is that motion is an important and universal signal that can drive us to select frames from videos adaptively.
Our MGSampler yields a new principled and holistic sample scheme, that could be incorporated into any existing video architecture.
arXiv Detail & Related papers (2021-04-20T13:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.