Fugu-MT 論文翻訳(概要): ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

論文の概要: ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

arxiv url: http://arxiv.org/abs/2605.22158v1
Date: Thu, 21 May 2026 08:27:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.16454
Title: ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
Title（参考訳）: ST-SimDiff: MLLMを用いた効率的なビデオ理解のための時空間的類似性と相違のバランス
Authors: Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding,
Abstract要約: 複雑なビデオ関係を扱うために,ST-Sim-Diffというトレーニング不要のフレームワークを開発した。提案手法は,計算コストを大幅に削減しつつ,最先端の手法を著しく上回っている。
参考スコア（独自算出の注目度）: 20.712141528369553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、大量のビジュアルトークンを必要とする長いビデオを処理する際に、計算上のオーバーヘッドに直面する。効率を改善するために、既存の手法は、重要性や類似性に基づいてトークンを刈り込んだり、マージしたりすることで、主に冗長性を減少させる。しかし、これらのアプローチはビデオコンテンツの重要な次元、すなわち変化と転換点を見落としており、時空間関係の協調モデルが欠如している。類似性は冗長性を特定するためであり、差異は重要なイベントをキャプチャするためのものである。そこで我々はST-SimDiffというトレーニング不要のフレームワークを設計した。まず、視覚トークンから時空間グラフを構築し、それらの複雑な関連を均一にモデル化する。その後、並列二重選択戦略を採用する。 1)類似性に基づく選択は,コミュニティ検出を用いて代表トークンを保持し,静的情報を圧縮する。 2) 時間差に基づく選択は、キーダイナミックシフトをキャプチャするトークンを保存するために、コンテンツ変更点を正確に特定する。これにより、最小限のトークン数で静的コンテンツと動的コンテンツの両方を保存することができる。大規模な実験により,提案手法は計算コストを大幅に削減しつつ,最先端の手法を著しく上回っていることが示された。私たちのコードはhttps://github.com/bingjunluo/ST-SimDiffで利用可能です。

論文の概要: ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

関連論文リスト