Fugu-MT 論文翻訳(概要): Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

論文の概要: Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

arxiv url: http://arxiv.org/abs/2510.14624v1
Date: Thu, 16 Oct 2025 12:34:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.85105
Title: Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Title（参考訳）: 効率的なビデオサンプリング: 高速なVLM推論のために一時的に冗長なトークンを抽出する
Authors: Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, Matthieu Le, Andrew Tao,
Abstract要約: 長いビデオはしばしば現代の言語モデルのトークン予算を超え、厳しいコンテキスト制限とレイテンシの問題に繋がる。本稿では,時間的に静的なパッチを識別・プルーニングすることで,ビデオ中のトークンの冗長性を低減できる簡易なプラグイン・アンド・プレイ方式であるEfficient Video Sampling (EVS)を紹介する。 EVSは意味的忠実性を維持しながらトークン数を大幅に削減し、より高速な推論とより長い入力シーケンスを可能にする。
参考スコア（独自算出の注目度）: 5.146388234814547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
Abstract（参考訳）: 視覚言語モデル(VLM)は最近、静的画像理解からビデオ推論へと拡張されているが、そのスケーラビリティは、高密度フレームシーケンスを処理する2次コストによって根本的に制限されている。長いビデオはしばしば現代の言語モデルのトークン予算を超え、厳しいコンテキスト制限とレイテンシの問題に繋がる。本稿では,時間的静的なパッチを識別・プルーニングすることで,ビデオ中のトークンの冗長性を低減し,連続するフレーム間で変化しない空間領域を抽出・抽出する,簡易なプラグイン・アンド・プレイ方式であるEfficient Video Sampling(EVS)を紹介する。 EVSは位置識別を保持し、アーキテクチャの変更や再トレーニングを必要としない。 EVSは意味的忠実性を保ちながらトークン数を大幅に削減し、より高速な推論とより長い入力シーケンスを実現する。推論時に適用されるEVSは、大きな言語モデル(LLM)を最大4倍の精度で削減する。確率的プルーニングレートを用いたアップトレーニングフェーズと組み合わせると、ESVは様々な圧縮レベルに対して堅牢で、アグレッシブプルーニング下でのフルパフォーマンスを維持するモデルを生成する。大規模な実験では、EVSは効率と精度のトレードオフを一貫して改善し、品質を犠牲にすることなくスケーラブルなビデオ言語理解を解放している。

論文の概要: Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

関連論文リスト