Fugu-MT 論文翻訳(概要): See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

論文の概要: See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

arxiv url: http://arxiv.org/abs/2604.05650v2
Date: Wed, 08 Apr 2026 18:13:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 14:10:47.881155
Title: See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
Title（参考訳）: 樹木の森を見よ:ビデオLLMの効率的な推論のための視覚的セマンティックガイダンスによる投機的デコーディング
Authors: Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, Huan Li,
Abstract要約: ビデオ大言語モデル(ビデオ-LLM)は、ビデオ理解に優れるが、自己回帰生成時に高いレイテンシに悩まされる。 LVSpecは,ビデオLLM用に調整された,初となる訓練不要なゆるやかなSDフレームワークである。
参考スコア（独自算出の注目度）: 25.611056558730127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.
Abstract（参考訳）: ビデオ大言語モデル(ビデオ-LLM)はビデオ理解に優れるが、自己回帰生成時に高い推論遅延に悩まされる。投機的復号法(SD)は、草案と検証のパラダイムを適用してこれを緩和するが、既存の手法は厳密な正確な整合規則によって制約され、加速ポテンシャルを著しく制限する。このギャップを埋めるために,ビデオLLM用に最適化された,訓練不要でゆるやかなSDフレームワークであるLVSpecを提案する。 LVSpecは、視覚的関連性の少ないアンカー(厳密さを強制する)によって制御されるという知見に基づいており、LVSpecは、前者を正確に特定するために、軽量な視覚関連トークン識別スキームを使用している。さらに受け入れを最大化するために、位置合わせミスマッチするが意味論的に等価なトークンを効果的に回収する位置ずれ耐性機構でこれを強化する。 LVSpecは目標性能の99.8 >を維持し、Qwen2.5-VL-32Bを2.70倍、LLaVA-OneVision-72Bを2.94倍加速する。特に、ビデオLLMのSOTAトレーニング不要SD法と比較して、平均許容長とスピードアップ率を136%、35%向上させる。

論文の概要: See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

関連論文リスト