Fugu-MT 論文翻訳(概要): MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

論文の概要: MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

arxiv url: http://arxiv.org/abs/2603.19993v1
Date: Fri, 20 Mar 2026 14:43:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:39.183334
Title: MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Title（参考訳）: MedSPOT: 臨床GUIのためのワークフロー対応シークエンシャルグラウンドベンチマーク
Authors: Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, Tajamul Ashraf,
Abstract要約: MedSPOTは、ワークフローを意識した臨床GUI環境のためのシーケンシャルグラウンドベンチマークである。ベンチマークは597の注釈付きで216のタスク駆動ビデオで構成されており、各タスクは2から3の相互依存的な接地ステップで構成されている。また、エッジバイアス、小さなターゲットエラー、予測なし、ミスに近い、ミスなし、ツールバーの混乱など、包括的な障害分類も導入しています。
参考スコア（独自算出の注目度）: 0.7552557021953206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
Abstract（参考訳）: MLLM (Multimodal Large Language Models) の急速な進歩にもかかわらず、高精細な臨床ソフトウェア環境で信頼性の高い視覚的接地を行う能力はいまだに未定である。既存のGUIベンチマークは、独立したステップと動的インターフェース状態の間でタスクが進化する現実世界の医療インターフェースで必要とされる、シーケンシャルでワークフロー駆動の推論を見越して、孤立した単一ステップのグラウンドクエリに重点を置いている。 MedSPOTは、ワークフローを意識した臨床GUI環境のためのシーケンシャルグラウンドベンチマークである。グラウンディングを独立した予測タスクとして扱う以前のベンチマークとは異なり、MedSPOTは手続き的相互作用を構造化された空間決定のシーケンスとしてモデル化する。ベンチマークは597の注釈付きキーフレームを備えた216のタスク駆動ビデオで構成されており、各タスクは現実的な医療ワークフローの中で2～3つの相互依存的な基礎ステップで構成されている。この設計は、インターフェース階層、コンテキスト依存、および進化する条件下でのきめ細かい空間精度をキャプチャする。手続き的ロバスト性を評価するために,複数ステップのワークフローにおけるエラーの伝播を明示的に計測し,第1の誤った接地予測に基づくタスクアセスメントを終了する厳密な逐次評価プロトコルを提案する。さらに、臨床GUI設定におけるモデル行動の体系的診断を可能にするために、エッジバイアス、小さなターゲットエラー、予測なし、ほぼミス、遠いミス、ツールバーの混乱を含む総合的な障害分類を導入する。 MedSPOTは、分離されたグラウンドからワークフロー対応シーケンシャル推論へ評価をシフトすることで、医療ソフトウェア環境におけるマルチモーダルモデルを評価するための、現実的で安全クリティカルなベンチマークを確立する。コードとデータは、https://github.com/Tajamul21/MedSPOT.comで入手できる。

論文の概要: MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

関連論文リスト