Fugu-MT 論文翻訳(概要): SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

論文の概要: SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

arxiv url: http://arxiv.org/abs/2508.16201v1
Date: Fri, 22 Aug 2025 08:23:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-25 16:42:36.310274
Title: SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Title（参考訳）: SpecVLM: Verifier-Guided Token PruningによるビデオLLMの投機的復号化
Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li,
Abstract要約: SpecVLMは、Vid-LLM向けに設計されたトレーニング不要の投機的復号化フレームワークである。最大90%のビデオトークンを抽出し、精度を犠牲にすることなく効率的な推測を可能にする。 LLaVA-OneVision-72Bの2.68$times$デコードスピードアップとQwen2.5-VL-32Bの2.11$times$スピードアップを実現している。
参考スコア（独自算出の注目度）: 27.000912841279597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.
Abstract（参考訳）: ビデオ大言語モデル(Vid-LLM)は、ビデオコンテンツを理解する上で強力な能力を示している。しかし、高密度なビデオトークン表現への依存は、プリフィルとデコードの両方において、メモリと計算上のオーバーヘッドを大幅に引き起こす。近年のビデオトークン削減手法の情報損失を軽減し、Vid-LLMの復号段階を損なうことなく加速するために、ステージ化されたビデオトークンプルーニングを組み込んだVid-LLMのためのトレーニング不要な投機的復号(SD)フレームワークSpecVLMを導入する。ドラフトモデルの推測がビデオトークンのプルーニングに対する感度の低下を示すという新たな発見に基づいて、SpecVLMは最大90%のビデオトークンをプルーニングし、精度を犠牲にすることなく効率的な推測を可能にする。ステージIは、検証者(ターゲットモデル)から注目信号によって導かれる高情報性の高いトークンを選択し、ステージIIは余分なトークンを空間的に均一に残す。 4つのビデオ理解ベンチマークの大規模な実験は、LLaVA-OneVision-72Bの2.68$\times$デコードスピードアップとQwen2.5-VL-32Bの2.11$\times$スピードアップを達成するSpecVLMの有効性と堅牢性を示している。

論文の概要: SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

関連論文リスト