Fugu-MT 論文翻訳(概要): Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

論文の概要: Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

arxiv url: http://arxiv.org/abs/2509.23672v1
Date: Sun, 28 Sep 2025 06:24:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.365163
Title: Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding
Title（参考訳）: 手術映像理解のための時空間情報マイニングによるトークンマージ
Authors: Xixi Jiang, Chen Yang, Dong Zhang, Pingcheng Dong, Xin Yang, Kwang-Ting Cheng,
Abstract要約: 本稿では,画像理解トークン統合法(STIM-TM)を提案する。 STIM-TMは、時間的および空間的次元に沿ってトークンの冗長性を独立に減少させる分離戦略を導入する。 STIM-TMは、トレーニングなしの方法で動作し、65ドル以上のGFLOPを削減し、総合的な手術ビデオタスク間の競争精度を維持しながら、かなりの効率を達成する。
参考スコア（独自算出の注目度）: 32.4892900455388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65\%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.
Abstract（参考訳）: ビジョントランスフォーマーモデルは、長距離依存性モデリングによる手術ビデオ理解タスクにおいて顕著な効果を示した。しかし、現在の手法は、ビデオフレーム全体にわたる大規模な時空間トークンの処理によって、計算コストの禁止に悩まされている。トークンマージの先行研究は、より高度なモデル効率を持つが、ビデオデータ固有の時空間構造を適切に考慮し、情報分布の不均一性を見落とし、最適以下の性能をもたらす。本稿では,外科的ビデオ理解のための最初の専用のアプローチとして,時空間情報マイニングトークンマージ(STIM-TM)法を提案する。 STIM-TMは、時間的および空間的次元に沿ってトークンの冗長性を独立に減少させる分離戦略を導入する。具体的には、時間成分は、サリエンシ重み付けを用いて連続フレームから空間的に対応するトークンをマージし、重要なシーケンシャル情報を保存し、連続性を維持する。一方、空間成分は、時間的安定性解析により静的トークンのマージを優先し、重要な外科情報を含む動的領域を保護する。トレーニングフリーのSTIM-TMは、総合的な手術ビデオタスク間の競争精度を保ちながら、65\%以上のGFLOPを削減し、大幅な効率向上を実現している。本手法は, 手術現場での計算ボトルネックに対処するため, 長期的手術映像の効率的な訓練も支援している。

論文の概要: Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

関連論文リスト