MGSampler: An Explainable Sampling Strategy for Video Action Recognition
- URL: http://arxiv.org/abs/2104.09952v1
- Date: Tue, 20 Apr 2021 13:24:01 GMT
- Title: MGSampler: An Explainable Sampling Strategy for Video Action Recognition
- Authors: Yuan Zhi, Zhan Tong, Limin Wang, Gangshan Wu
- Abstract summary: We present an explainable, adaptive, and effective frame sampler, called Motion-guided Sampler (MGSampler)
Our basic motivation is that motion is an important and universal signal that can drive us to select frames from videos adaptively.
Our MGSampler yields a new principled and holistic sample scheme, that could be incorporated into any existing video architecture.
- Score: 30.516462193231888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Frame sampling is a fundamental problem in video action recognition due to
the essential redundancy in time and limited computation resources. The
existing sampling strategy often employs a fixed frame selection and lacks the
flexibility to deal with complex variations in videos. In this paper, we
present an explainable, adaptive, and effective frame sampler, called
Motion-guided Sampler (MGSampler). Our basic motivation is that motion is an
important and universal signal that can drive us to select frames from videos
adaptively. Accordingly, we propose two important properties in our MGSampler
design: motion sensitive and motion uniform. First, we present two different
motion representations to enable us to efficiently distinguish the motion
salient frames from the background. Then, we devise a motion-uniform sampling
strategy based on the cumulative motion distribution to ensure the sampled
frames evenly cover all the important frames with high motion saliency. Our
MGSampler yields a new principled and holistic sample scheme, that could be
incorporated into any existing video architecture. Experiments on five
benchmarks demonstrate the effectiveness of our MGSampler over previously fixed
sampling strategies, and also its generalization power across different
backbones, video models, and datasets.
Related papers
- Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding [43.587729230845525]
Current methods typically select frames with high relevance to a given query.<n>We introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework.<n>WFS-SB significantly boosts LVLM performance, improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench.
arXiv Detail & Related papers (2026-02-28T07:18:07Z) - LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning [73.90466023069125]
We propose LOVE-R1, a model that can adaptively zoom in on a video clip.<n>The model is first provided with densely sampled frames but in a small resolution.<n>If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution.
arXiv Detail & Related papers (2025-09-29T13:43:55Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Frame Sampling Strategies Matter: A Benchmark for small vision language models [3.719563722270237]
We propose the first frame-accurate benchmark of state-of-the-art small vision language models for video question-answering.<n>Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques.
arXiv Detail & Related papers (2025-09-18T09:18:42Z) - DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection [8.294763803639391]
Video Anomaly Detection (VAD) is critical for surveillance and public safety.<n>Existing benchmarks are limited to either frame-level or video-level tasks.<n>This work introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage.
arXiv Detail & Related papers (2025-09-15T05:48:22Z) - An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval [1.6581184950812533]
We investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions.
Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern.
arXiv Detail & Related papers (2024-07-22T11:44:08Z) - End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling [43.024232182899354]
We propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA.
We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video.
The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods.
arXiv Detail & Related papers (2024-07-21T04:09:37Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance [83.25826307000717]
We study the challenging problem of recovering detailed motion from a single motion-red image.
Existing solutions to this problem estimate a single image sequence without considering the motion ambiguity for each region.
In this paper, we explicitly account for such motion ambiguity, allowing us to generate multiple plausible solutions all in sharp detail.
arXiv Detail & Related papers (2022-07-20T18:05:53Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Context-Aware Video Reconstruction for Rolling Shutter Cameras [52.28710992548282]
In this paper, we propose a context-aware GS video reconstruction architecture.
We first estimate the bilateral motion field so that the pixels of the two RS frames are warped to a common GS frame.
Then, a refinement scheme is proposed to guide the GS frame synthesis along with bilateral occlusion masks to produce high-fidelity GS video frames.
arXiv Detail & Related papers (2022-05-25T17:05:47Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.