Fugu-MT 論文翻訳(概要): StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

論文の概要: StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

arxiv url: http://arxiv.org/abs/2603.07307v1
Date: Sat, 07 Mar 2026 18:30:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.219607
Title: StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models
Title（参考訳）: StructSAM:セグメンテーションモデルのための構造とスペクトル保存トークンマージ
Authors: Duy M. H. Nguyen, Tuan A. Tran, Duong Nguyen, Siwei Xie, Trung Q. Nguyen, Mai T. N. Truong, Daniel Palenicek, An T. Le, Michael Barz, TrungTin Nguyen, Tuan Dam, Ngan Le, Minh Vu, Khoa Doan, Vien Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias Niepert,
Abstract要約: StructSAMは、Segment Anything Model(SAM)に適した解像度保存型マージアンマージフレームワークであるまた,StructSAMはエンコーダFLOPsを25～30%削減し,mIoU/Diceに少量の低下を認めた。また、スペクトルグラフ粗大化ビューでは、スコア誘導マージにより、ランダムまたはウィンドウ制限ベースラインと比較して、ラプラシアスペクトル歪みが有界となることを示す。
参考スコア（独自算出の注目度）: 57.674757786328236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.
Abstract（参考訳）: ビジョントランスフォーマー(ViT)の最近のトークンマージ技術は、しばしば再トレーニングすることなく、自己アテンションによって処理されるトークンの数を減らし、相当なスピードアップを提供する。しかし、Segment Anything Model (SAM) への直接適用は簡単ではない:SAM の画像エンコーダはウィンドウとグローバルな注意を混合し、マスクデコーダは厳密で迅速な条件付き特徴に依存して正確な境界予測を行う。 SAMとMedical SAMのトークンマージ手法を厳密なオフザシェルフ設定で体系的に評価し,既存の目的地選択ヒューリスティックスがマージ率の増加とともに境界線を逸脱し,プロンプト情報を漏らす可能性があることを発見した。本稿では,SAM に合わせた解像度保存型マージアンマージフレームワークである \textbf{StructSAM} を提案する。 StructSAMは、一階特徴勾配からの軽量なトークンエネルギースコアを計算し、グリッドベースのフラットネススクリーニングを使用して境界とプロンプト領域を保護し、明示的なトークン回復を伴う低エネルギー目的地に向けてフラットエリア内のトークンをマージする。さらに、スコア誘導マージによるスペクトル歪みは、ランダムまたはウィンドウ制限ベースラインと比較して、ラプラシアスペクトル歪みが有界であることを示すスペクトルグラフ粗大化ビューを提供する。 8つの自然と医療のベンチマークで、StructSAMはエンコーダFLOPを25～30\%(最大40\%以上のプロンプト対応マージ)に減らし、mIoU/Diceをわずかに減らし、ToMe、PiToMe、ToMeSD、VidToMe、ALGMを一貫して上回っている。

論文の概要: StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

関連論文リスト