Fugu-MT 論文翻訳(概要): LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

論文の概要: LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

arxiv url: http://arxiv.org/abs/2605.17260v2
Date: Sat, 23 May 2026 08:41:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 16:32:37.565606
Title: LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
Title（参考訳）: LiteFrame: ビデオLLMにおける効率的なビジョンエンコーダアンロックフレームスケーリング
Authors: Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong,
Abstract要約: LiteFrameは、ビデオ大言語モデルのための強力な、しかし非常に効率的なバックボーンである。 LiteFrameはエンドツーエンドのレイテンシを35%削減し、8$times$より多くのフレームを処理する。計算予算の固定化により,より長めの映像理解を解き明かす可能性を示した。
参考スコア（独自算出の注目度）: 90.77662862634509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.
Abstract（参考訳）: ビデオLLM(Video Large Language Models)を長期ビデオに拡張する基本的な課題は、視覚的なコンテキスト長の爆発を管理することである。既存の戦略は、主に"ポストホック"トークンの削減に焦点を当てている -- LLMの計算オーバーヘッドを軽減するために、機能抽出後の視覚トークンの削減。これらの手法は視覚トークンの数を効果的に削減するが、一次遅延ボトルネックはLLMから視覚エンコーダの高価なフレーム単位の処理に移行する。これを解決するために,ビデオLLM用の強力な,かつ高効率なビデオエンコーダバックボーンである LiteFrame を導入する。 LiteFrameをトレーニングするために,大規模な教師の視覚モデルによって生成された時空間圧縮表現を直接予測し,冗長な計算を効果的に回避する,コンパクトな学生ビジョンエンコーダを指導する新しいトレーニングフレームワークであるCompressed Token Distillation (CTD)を提案する。さらなるLanguage Model Adaptation (LMA)と組み合わせると、このアプローチは新たなレイテンシ精度のParetoフロンティアをもたらす -- InternVL3-8Bと比較して、LiteFrameはエンドツーエンドのレイテンシを35%削減し、8$\times$以上のフレームを処理し、複数のベンチマークの平均的なビデオ理解精度を向上させる。計算予算の固定化により,より長めの映像理解を解き放つ新たな道筋が示される。

論文の概要: LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

関連論文リスト