Fugu-MT 論文翻訳(概要): SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

論文の概要: SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

arxiv url: http://arxiv.org/abs/2511.03178v1
Date: Wed, 05 Nov 2025 04:55:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.326412
Title: SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Title（参考訳）: SurgAnt-ViVQA:GRUによる時間的交叉による手術イベントの予測学習
Authors: Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarak I. Hoque,
Abstract要約: 鼻腔鏡下下下垂体手術のリアルタイム支援には,今後の外科的事象の予測が不可欠である。ほとんどの視覚的質問応答(VQA)システムは、静的視覚言語アライメントを持つ独立したフレームを推論する。先見的外科的推論のために設計された最初のVQAデータセットであるPitVQA-Anticipationを紹介する。
参考スコア（独自算出の注目度）: 10.149538951173598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.
Abstract（参考訳）: 鼻腔鏡下下下垂体手術のリアルタイム支援には,手術イベントの予測が不可欠であり,視認性やワークフローの急激な変化が期待できる。視覚的質問応答(VQA)システムのほとんどは、静的な視覚言語アライメントを持つ独立したフレームを前提としており、次のステップや機器のニーズを予測するためのほとんどサポートを提供していない。既存の外科用VQAデータセットも同様に、近い将来ではなく現在のシーンに中心を置いている。先見的外科的推論のために設計された最初のVQAデータセットであるPitVQA-Anticipationを紹介する。 33.5時間の手術ビデオと、734,769の質問応答ペアが、時間的にグループ化されたクリップと4つのタスクからなる専門家アノテーションで構成されている。さらに,GRU Gated Temporal Cross-Attentionモジュールを用いて,大規模言語モデルを適応させるビデオ言語モデルであるSurgAnt-ViVQAを提案する。双方向GRUはフレームをフレームダイナミクスにエンコードし、適応ゲートはトークンレベルで言語ストリームに視覚的コンテキストを注入する。パラメータ効率の良い微調整は、言語バックボーンを手術領域にカスタマイズする。 SurgAnt-ViVQAは、PitVQA-AnticipationとEndoVisデータセットでテストし、強力な画像とビデオベースのベースラインを超えた。アブレーションは、時間的再発とゲート融合がほとんどの利得を駆動することを示している。 8フレームは流速を最大にするが、32フレームはBLEUをわずかに削減するが、数値時間推定を改善する。 SurgAnt-ViVQAは、時間的に認識されたエンコーダを微粒なゲート交差注意と組み合わせることで、外科的VQAを振り返り記述から前向きな予測へと前進させる。 PitVQA-Anticipationはこの設定の総合的なベンチマークを提供し、信頼性のある将来の手術支援を目的とした時間的モデリングの重要性を強調している。

論文の概要: SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

関連論文リスト