Fugu-MT 論文翻訳(概要): SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

論文の概要: SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

arxiv url: http://arxiv.org/abs/2509.00357v1
Date: Sat, 30 Aug 2025 04:36:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.19437
Title: SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
Title（参考訳）: SurgLLM:画像理解のための空間的焦点と時間的認識を備えた多モード多モードモデル
Authors: Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, Jiebo Luo,
Abstract要約: SurgLLMフレームワークは、多用途の手術ビデオ理解タスクに適した、大規模なマルチモーダルモデルである。外科的ビデオの空間的焦点を高めるために,SurgLLMの動画エンコーダのためのSurg-Pretraining(Surg-Pretraining)を最初に考案した。外科的時間的知識をSurgLLMに組み込むため, インターリーブ型マルチモーダル埋め込みによる時間的推論を改善するために, 時間的対応型マルチモーダルチューニング(TM-Tuning)を提案する。
参考スコア（独自算出の注目度）: 75.00667948967848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.
Abstract（参考訳）: 手術ビデオ理解はコンピュータ支援手術(CAS)システムを容易にするために重要である。既存の研究の著しい進歩にもかかわらず、視覚的内容の認識が不十分なことと、手術ビデオにおける時間的認識が不十分なこと、多目的CASソリューションの開発を妨げることの2つの大きな制限が続いている。本研究では,空間的焦点と時間的認識を増強した多目的な手術ビデオ理解タスクに適した,効果的な大規模マルチモーダルモデルであるSurgLLMフレームワークを提案する。具体的には,SurgLLMの映像エンコーダに対して,機器中心のMasked Video Reconstruction(MV-Recon)とその後のマルチモーダルアライメントにより,外科的コンテキスト対応型マルチモーダルプレトレーニング(Surg-Pretraining)を考案した。外科的時間的知識をSurgLLMに組み込むため, インターリーブ型マルチモーダル埋め込みによる時間的推論を改善するために, 時間的対応型マルチモーダルチューニング(TM-Tuning)を提案する。さらに,コンフリクトのない手術ビデオの様々な理解作業に対応するために,SurgLLMの最適な学習パラメータを用いたクエリを効率的にトリアージするために,手術作業動的アンサンブルを考案した。外科的ビデオ理解におけるSurgLLMの有効性を検証し, 字幕, 一般VQA, 時間的VQAを含む多種多様な外科的ビデオ理解タスクで実施した広範囲な実験により, 最先端のアプローチに対する大幅な改善が示された。ソースコードはhttps://github.com/franciszchen/SurgLLM.comで入手できる。

論文の概要: SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

関連論文リスト