Fugu-MT 論文翻訳(概要): SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

論文の概要: SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

arxiv url: http://arxiv.org/abs/2605.01911v2
Date: Tue, 05 May 2026 13:32:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 14:45:21.2513
Title: SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
Title（参考訳）: SurgCheck: ヴィジュアル・ランゲージのモデルは、手術用VQAの画像を見るか?
Authors: Jongmin Shin, Ka Young Kim, Eunki Cho, Seong Tae Kim, Namkee Oh,
Abstract要約: 視覚言語モデル(VLM)は、外科的視覚質問応答(VQA)において有望な性能を示す。報告されたパフォーマンスが、このような言語的ショートカットへの依存や視覚的理解を反映しているかどうかは不明である。本稿では,外科的VQAにおける言語的ショートカット依存度を定量化するための診断ベンチマークであるSurgCheckを紹介する。
参考スコア（独自算出の注目度）: 4.831595834944456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.
Abstract（参考訳）: 目的:視覚言語モデル(VLM)は,外科的視覚質問応答(VQA)において有望な性能を示した。しかしながら、既存の外科的VQAデータセットは、しばしば言語的ショートカットを含んでおり、質問は答え空間を暗黙的に制限する。報告されたパフォーマンスが、このような言語的ショートカットへの依存や視覚的理解を反映しているかどうかは不明だ。方法: 外科的VQAにおける言語的ショートカット依存度を定量化するための診断ベンチマークであるSurgCheckを紹介する。 SurgCheckは、それぞれの手術フレームにエンティティ名を含む元の質問と、同一の視覚的内容と地味な回答を保持しながらこれらの名前を削除するバイアスの低い質問を関連付けるペアクエストデザインを採用している。結果として生じるパフォーマンスギャップは、ショートカット依存の診断信号を提供する。エンティティ名がなくても、バイアスの少ない質問が適切に定義されていることを保証するため、バウンディングボックス、矢印、空間位置、およびペリフラシスの4つのグラウンドングキューが組み込まれている。 SurgCheckのゼロショットおよび微調整条件下での汎用VLMおよび外科用VLMの評価を行った。 LLM-as-a-judge 評価プロトコルを導入する。結果: SurgCheckを用いて, 視覚的入力が同一であるにもかかわらず, 5つのVLMにおいて, バイアスの少ない質問に対して一貫した性能劣化を観測した。テキストのみのアブレーションでは、アクションとターゲット予測のパフォーマンス低下が最小限に抑えられ、アクションとターゲット予測は視覚的推論ではなく言語的ショートカットによって主に駆動されることを示している。結論: SurgCheckは、既存の外科的VQAベンチマークで言語バイアスで隠された障害モードを公開する制御された診断フレームワークを提供する。以上の結果から,強いベンチマーク性能は必ずしも忠実な視覚的理解を示唆するものではないことが示唆され,外科的VQAにおけるバイアス認識評価の必要性が示唆された。

論文の概要: SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

関連論文リスト