Fugu-MT 論文翻訳(概要): ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

論文の概要: ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

arxiv url: http://arxiv.org/abs/2603.22911v1
Date: Tue, 24 Mar 2026 08:01:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.370308
Title: ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
Title（参考訳）: ForestPrune:空間的テンポラルフォレストモデリングによるビデオマルチモーダル大言語モデルの高比視覚的トーケン圧縮
Authors: Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji,
Abstract要約: 本研究では,フォレストプルーンと呼ばれるビデオMLLMの新規かつトレーニング不要なトークン解析手法を提案する。 ForestPruneは、時空間フォレストモデリングによる効果的で高率な伐採を実現する。実際には、フォレストプルーンは意味的制約、空間的制約、時間的制約に基づいてビデオフレームにトークンフォレストを構築している。
参考スコア（独自算出の注目度）: 58.993082360672645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.
Abstract（参考訳）: 計算とメモリオーバーヘッドの大幅な削減により、トークン圧縮はMLLMの研究ホットスポットとなり、画像言語タスクにおいて顕著な進歩を遂げた。しかし、ビデオでは、既存の手法は高比のトークン圧縮に欠けていた。本研究では,時間的・連続的な映像コンテンツのモデリングが不十分であることから,時間的・時間的フォレストモデリングによる効果的かつ高比率のプルーニングを実現するビデオMLLMの新規かつトレーニング不要なトークン・プルーニング手法を提案する。実際には、フォレストプルーンは意味的制約、空間的制約、時間的制約に基づいてビデオフレームにトークンフォレストを構築する。その後、フォレストプルーンは木深度とノードの役割に基づいてトークンツリーとノードの重要性を評価し、グローバルに最適なプルーニング決定を得る。 ForestPruneを検証するために、LLaVA-VideoとLLaVA-OneVisionの2つの代表的ビデオMLLMに適用し、多数のビデオベンチマークで広範な実験を行う。 LLaVA-OneVisionの90%のトークンを削減しつつ、95.8%の平均精度を保ちながら、MLVUでは+10.1%の精度、LLaVA-Videoでは-81.4%のプルーニング時間で比較したトークン圧縮法よりも優れた性能と効率を示した。

論文の概要: ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

関連論文リスト