Fugu-MT 論文翻訳(概要): GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

論文の概要: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

arxiv url: http://arxiv.org/abs/2510.13734v1
Date: Wed, 15 Oct 2025 16:40:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.765294
Title: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians
Title（参考訳）: GAPS:AI臨床医を評価するための臨床応用自動ベンチマーク
Authors: Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin Feng, Ruzhen Li, Jiayi Geng, Changtai Zhao, Jun Wang, Guihu Lin, Peihao Li, Liqi Liu, Peng Wei, Jian Wang, Jinjie Gu, Ping Wang, Fan Yang,
Abstract要約: AI臨床システムの現在のベンチマークは、実際の臨床実践に必要な深さ、堅牢性、安全性を捉えていない。本稿では, GAPSフレームワーク, textbfGrounding (認識深度), textbfAdequacy (回答完全性), textbfPerturbation (損耗性), textbfSafetyを提案する。 GAPS準拠のベンチマークをエンドツーエンドに構築するための,完全自動化されたガイドライン変換パイプラインを開発した。
参考スコア（独自算出の注目度）: 32.33432636089606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.
Abstract（参考訳）: AI臨床システムの現在のベンチマークは、多くの場合、複数の選択試験や手動ルーリックに基づいており、実際の臨床実践に必要な深さ、堅牢性、安全性を捉えていない。これを解決するために、GAPSフレームワーク、つまり、認識深度(認識深度)、問合せ完全度(問合せ完全度)、問合せ乱れ(解答完全度)、問合せ不完全度(解答完全度)、問合せ不完全度(解答完全度)、問合せ不完全度(解答完全度)の評価のための多次元パラダイムである。批判的に我々は,GAPSに準拠したベンチマークをエンドツーエンドに構築する,完全に自動化されたガイドライン型パイプラインを開発し,事前作業のスケーラビリティと主観性に限界を克服した。我々のパイプラインはエビデンス地区を組み立て、二重グラフとツリー表現を作成し、Gレベルにまたがる質問を自動的に生成する。ゴムは、GRADE-consistent、PICO-driven evidence reviewをReActループで模倣するDeepResearchエージェントによって合成される。スコアリングは、大きな言語モデル(LLM)の裁判官のアンサンブルによって実行される。検証の結果, 自動質問は高品質であり, 臨床医の判断と一致していることがわかった。性能は、推論深度(G軸)の増加とともに急激に低下し、解答完全性(A軸)に苦しむモデルであり、敵の摂動(P軸)や特定の安全問題(S軸)に対して非常に脆弱である。この自動化された臨床現場のアプローチは、AIクリニックシステムを厳格に評価し、より安全で信頼性の高い臨床実践に向けた開発を導く、再現可能でスケーラブルな方法を提供する。

論文の概要: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

関連論文リスト