Fugu-MT 論文翻訳(概要): SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

論文の概要: SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

arxiv url: http://arxiv.org/abs/2603.29962v1
Date: Tue, 31 Mar 2026 16:27:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.858192
Title: SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy
Title（参考訳）: SurgTEMP : 腹腔鏡下胆嚢摘出術におけるテキストガイド付きビジュアルメモリを用いた術中ビデオ質問紙
Authors: Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy,
Abstract要約: SurgTEMPは、外科的ビデオ質問応答のための多モード視覚記憶フレームワークである。我々は,32KのオープンエンドQAペアと3,855の動画セグメントからなる外科的ビデオ質問応答データセットであるCholeVidQA-32Kを紹介する。 SurgTEMPは、最先端のオープンソースマルチモーダルおよびビデオLLM(ファインチューニングとゼロショット)に対する包括的な評価において、大幅な性能改善を実現している。
参考スコア（独自算出の注目度）: 12.553409376394162
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.
Abstract（参考訳）: 外科手術は本質的に複雑で危険であり、進化する手術シーンをうまくナビゲートするために広範囲の専門知識と常に焦点を合わせる必要がある。外科的視覚質問応答(VQA)のようなコンピュータ支援システムは、教育と術中支援の約束を提供する。現在の外科的VQA研究は、リッチな時間的意味論を見越して静的フレーム分析に重点を置いている。外科的ビデオ質問応答は、低視差、その知識駆動性が高い性質、散在する時間的窓にまたがる多様な分析的ニーズ、そして基本的な知覚から高レベルの術中評価に至るまでの階層によってさらに挑戦される。これらの課題に対処するため,マルチモーダルLLMフレームワークであるSurgTEMPを提案する。 (i)階層型ビジュアルメモリ(空間的・時間的メモリバンク)を構築するクエリ誘導トークン選択モジュール (II)外科的能力向上(SCP)トレーニングスキーム。これらのコンポーネントは、プロシージャ関連キューと時間的コヒーレンスを保持しながら、可変長の手術ビデオの効果的なモデリングを可能にし、下流でのさまざまな評価タスクをより良くサポートする。腹腔鏡下胆嚢摘出術から32Kの開眼QAペアと3,855の動画セグメント(合計128時間)からなる外科的ビデオ質問応答データセットであるColeVidQA-32Kを紹介した。データセットは、計器/行動/解剖学的知覚からCVS(Critical View of Safety)、術中難易度、熟練度、有害事象評価まで、11のタスクにまたがる、知覚、評価、推論という3段階の階層に分類される。 SurgTEMPは、最先端のオープンソースマルチモーダルおよびビデオLLM(微細調整およびゼロショット)に対する総合的な評価において、ビデオベースの外科的VQAの状態を推し進め、大幅なパフォーマンス向上を実現している。

論文の概要: SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

関連論文リスト