Fugu-MT 論文翻訳(概要): SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

論文の概要: SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

arxiv url: http://arxiv.org/abs/2511.03325v1
Date: Wed, 05 Nov 2025 09:40:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.397636
Title: SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Title（参考訳）: SurgViVQA: 手術シーン理解のためのビデオ質問回答
Authors: Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque,
Abstract要約: 外科領域におけるビデオ質問応答 (Video Question Answering, VideoQA) は, 時間的に整合した事象をAIモデルで推論することで, 術中理解を高めることを目的としている。静的画像から動的手術シーンへの視覚的推論を拡張するモデルであるSurgViVQAを提案する。 Masked Video-Textを使ってビデオと質問機能を融合し、モーションやツール間のインタラクションなどの時間的手がかりをキャプチャする。
参考スコア（独自算出の注目度）: 11.424693319143715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11\% on REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.
Abstract（参考訳）: 外科領域におけるビデオ質問回答 (Video Question Answering, VideoQA) は、AIモデルが独立したフレームではなく、時間的に一貫性のある事象を推論できるようにすることで、術中理解を高めることを目的としている。現在のアプローチは静的なイメージ機能に限定されており、利用可能なデータセットには時間的アノテーションがなく、正確な手続き的解釈に不可欠なダイナミクスを無視していることが多い。静的画像からダイナミックな手術シーンへの視覚的推論を拡張する手術用ビデオQAモデルであるSurgViVQAを提案する。 Masked Video-Text Encoderを使ってビデオと質問機能を融合し、モーションやツール-タスクインタラクションなどの時間的手がかりをキャプチャする。その性能を評価するため,動作関連質問や診断属性を含む大腸内視鏡的ビデオデータセットであるREAL-Colon-VQA(REAL-Colon-VQA)を作成した。 REAL-Colon-VQAとパブリックなEndoVis18-VQAデータセットに対する実験的検証は、SurgViVQAが既存の画像ベースのVQAベンチマークモデル、特にキーワードの精度で、REAL-Colon-VQAで+11\%、EndoVis18-VQAで+9\%向上していることを示している。質問に対する摂動研究は、質問文の変動に対する一般化性と頑健性の向上をさらに確認する。 SurgViVQAとREAL-Colon-VQAデータセットは、外科的ビデオQAにおける時間的理解のためのフレームワークを提供する。コードとデータセットはhttps://github.com/madratak/SurgViVQA.comで入手できる。

論文の概要: SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

関連論文リスト