Fugu-MT 論文翻訳(概要): UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

論文の概要: UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

arxiv url: http://arxiv.org/abs/2604.03645v1
Date: Sat, 04 Apr 2026 08:44:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.695966
Title: UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation
Title（参考訳）: UniSurgSAM: 信頼性の高い手術ビデオセグメンテーションのための統一型プロンプタブルモデル
Authors: Haofeng Liu, Ziyue Wang, Alex Y. W. Kong, Guanyi Qin, Yunqiu Xu, Chang Han Low, Mingqi Gao, Lap Yan Lennon Chan, Yueming Jin,
Abstract要約: 視覚的,テキスト的,あるいは音声的プロンプトによる信頼性の高い手術ビデオ分割を可能にする統合PVOSモデルUniSurgSAMを提案する。本稿では,幻覚の抑制を目的とした存在認識復号法,拡張シーケンス上のマスクドリフトを防止する境界認識長期追跡法,障害回復のための段階間のループを閉じる適応状態遷移の3つの重要な設計を提案する。 UniSurgSAMは、あらゆる急進的なモダリティと粒度にわたる最先端のパフォーマンスをリアルタイムで達成し、コンピュータ支援手術の実践的な基盤を提供する。
参考スコア（独自算出の注目度）: 18.74680721916099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at https://jinlab-imvr.github.io/UniSurgSAM.
Abstract（参考訳）: 手術用ビデオセグメンテーションはコンピュータ支援手術の基本である。実際には、外科医は視覚的選択、テキスト表現、音声指示などの異質な手がかりを使用して、拡張手順を通してターゲットを動的に特定する必要がある。しかしながら、既存の Promptable Video Object Segmentation (PVOS) メソッドは通常、単一のプロンプトモードに制限されており、ターゲットの初期化と追跡の間の最適化の干渉を引き起こす結合フレームワークに依存している。さらに、これらの手法は、目標が存在しないときに幻覚予測を行い、故障回復を伴わずに蓄積したマスクドリフトに悩まされる。これらの課題に対処するために、UniSurgSAMという統合PVOSモデルを提案し、視覚、テキスト、音声のプロンプトを通して、信頼できる手術ビデオのセグメンテーションを可能にする。具体的には、UniSurgSAMは分離された2段階のフレームワークを使用しており、最適化の干渉を解決するために、独立して初期化と追跡を最適化している。本フレームワークでは,幻覚を抑えるためにモデルが不在を目標とする存在認識復号法,拡張シーケンス上のマスクドリフトを防止する境界認識長期追跡法,障害回復のためのステージ間のループを閉じる適応状態遷移という3つの重要な信頼性設計を導入する。さらに, 実例レベルのマスクレットを用いた4つの手術データセットから, マルチモーダルおよびマルチグラニュラーベンチマークを構築した。広汎な実験により、UniSurgSAMはあらゆる急進的なモダリティと粒度にわたって、リアルタイムに最先端のパフォーマンスを達成し、コンピュータ支援手術の実践的な基盤を提供する。コードとデータセットはhttps://jinlab-imvr.github.io/UniSurgSAMで入手できる。

論文の概要: UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation

関連論文リスト