Fugu-MT 論文翻訳(概要): D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

論文の概要: D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

arxiv url: http://arxiv.org/abs/2510.08818v1
Date: Thu, 09 Oct 2025 21:08:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.79575
Title: D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Title（参考訳）: D-CoDe:動的圧縮と質問分解による映像対応VLMの映像化
Authors: Yiyang Huang, Yizhou Wang, Yun Fu,
Abstract要約: ビデオ大言語モデル(Vid-LLM)は多様なビデオ言語タスクに優れる。 D-CoDeは動的圧縮と質問分解を組み込んだトレーニングフリー適応フレームワークである。実験により、D-CoDeは様々なベンチマークでビデオの理解を効果的に改善することが示された。
参考スコア（独自算出の注目度）: 36.19028662042685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.
Abstract（参考訳）: 多様なビデオ言語タスクに優れたビデオ大言語モデル(Vid-LLMs)は、VLM(Image-pretrained Vision-Language Model)を適応させることで、効果的に構築できる。しかし、画像ベースモデルの容量を超える高密度かつ時間的に拡張された視覚入力を処理する必要があるため、この適応は依然として困難である。本稿では、画像ベースのVLMをビデオ領域に拡張する上で重要な課題として、認識ボトルネックとトークン過負荷を取り上げている。これらの問題に対処するために,動的圧縮と質問分解を組み込んだトレーニング不要適応フレームワークD-CoDeを提案する。具体的には、動的圧縮は、代表フレームの適応的な選択と空間トークンのコンテンツ認識集約によって認識ボトルネックを緩和し、情報的コンテンツを保存しながら冗長性を低下させる。並行して、質問分解は、元のクエリをサブクエストに再構成することでトークン過負荷を軽減し、モデルをビデオの異なる側面に集中させ、より包括的な理解を可能にする。実験により、D-CoDeは様々なベンチマークでビデオの理解を効果的に改善することが示された。さらに、挑戦的な長ビデオベンチマークの強力なパフォーマンスは、複雑なビデオ言語タスクを扱う上でのD-CoDeの可能性を強調している。コードはhttps://github.com/hukcc/D-CoDe.comで入手できる。

論文の概要: D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

関連論文リスト