Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
- URL: http://arxiv.org/abs/2510.14624v1
- Date: Thu, 16 Oct 2025 12:34:38 GMT
- Title: Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
- Authors: Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, Matthieu Le, Andrew Tao,
- Abstract summary: Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues.<n>We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches.<n>EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences.
- Score: 5.146388234814547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
Related papers
- Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - Dense Video Understanding with Gated Residual Tokenization [49.17263029080152]
High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
arXiv Detail & Related papers (2025-09-17T17:34:40Z) - Variation-aware Vision Token Dropping for Faster Large Vision-Language Models [24.952668143243542]
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks.<n> Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency.<n>We propose Variation-aware Vision Token Dropping (textiti.e., textbfV$2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference.
arXiv Detail & Related papers (2025-09-01T15:28:44Z) - PEVLM: Parallel Encoding for Vision-Language Models [4.777805570120456]
We introduce textbfPEVLM, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of Vision-Language Models.<n>PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention.<n>Experiments demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to textbf7.47x speedup in attention computation and reducing end-to-end latency by textbf40%.
arXiv Detail & Related papers (2025-06-24T14:14:52Z) - METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding [55.38256656122857]
We propose METok, a training-free, Multi-stage Event-based Token compression framework.<n>We show METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens.<n>For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings.
arXiv Detail & Related papers (2025-06-03T13:19:41Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems.
One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information.
We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.