Fugu-MT 論文翻訳(概要): Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

論文の概要: Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

arxiv url: http://arxiv.org/abs/2605.22078v1
Date: Thu, 21 May 2026 07:16:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.132592
Title: Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
Title（参考訳）: 学習自由時空間プールとグリッドによるビデオ大言語モデルの視覚的トーケン表現の強化
Authors: Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding,
Abstract要約: ビデオ大言語モデルに特化して設計されたトレーニングプール型視覚トークン拡張手法ST-Gridを提案する。提案手法は,階層的時間的格子化により時間的相互作用を捉えるピラミッド時間格子法 (PTG) を統合した。本手法は,視覚的トークン表現を改善するための,効率的なプラグアンドプレイソリューションを提供する。
参考スコア（独自算出の注目度）: 11.95854121762109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の最近の進歩は、ビデオ理解タスクが大幅に進歩しているが、時空間の相互作用を保ちながら視覚トークンを効率よく圧縮することが課題である。 LLaVAファミリのような既存の手法では、視覚トークンの複雑なダイナミクスを見渡す単純なプーリングや補間技術を使用している。このギャップを埋めるために,ビデオLLM用に設計された新しいトレーニング不要なビジュアルトークン拡張手法ST-GridPoolを提案する。提案手法は,階層的時間的グリッドングによる多粒度時空間相互作用を捉えるピラミッド時間グリッド (PTG) と,トークンノルムとセマンティック・リッチネスの相関を利用して高情報視覚領域を保存するノルム空間プール (NSP) を統合した。様々なベンチマーク実験により、ST-GridPoolは、コストのかかる再トレーニングを必要とせずに、ビデオLLMのパフォーマンスを継続的に向上することを示した。本手法は,視覚的トークン表現を改善するための,効率的なプラグアンドプレイソリューションを提供する。私たちのコードはhttps://github.com/bingjunluo/ST-GridPool.orgで利用可能です。

論文の概要: Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

関連論文リスト