Fugu-MT 論文翻訳(概要): GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

論文の概要: GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

arxiv url: http://arxiv.org/abs/2605.24636v1
Date: Sat, 23 May 2026 15:53:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.287844
Title: GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
Title（参考訳）: GlobalDentBench: 専門的校正を伴う歯科におけるLCM臨床推論の評価のための多国間ベンチマーク
Authors: Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang,
Abstract要約: 大規模言語モデル(LLMs)は医学の変革的な可能性を持っているが、実際の臨床シナリオにおけるそれらの推論の堅牢性や安全性は、特に歯科医学において非常に過小評価されている。ここでは、88か国と6大陸にまたがる14の歯科専門分野を含む分類を特徴とする、最初の多国籍歯科用ベンチマークであるGlobalDentBenchを紹介する。ベンチマークは3つのフォーマット(複数選択、短い回答、ケースベースの質問)にまたがる8,978のエキスパート検証された質問で構成され、3つのプログレッシブな推論レベルを評価する。
参考スコア（独自算出の注目度）: 35.3851755931076
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.
Abstract（参考訳）: 大きな言語モデル (LLMs) は医学の変革的な可能性を持っているが、実際の臨床シナリオにおける彼らの推論の堅牢性と安全性は、特に歯科医学において、非常に過小評価されている。ここでは、88か国と6大陸にまたがる14の歯科専門分野を含む分類を特徴とする、最初の多国籍歯科用ベンチマークであるGlobalDentBenchを紹介する。このベンチマークは、3つのフォーマット(複数選択、短問答、ケースベース質問)にまたがる8,978のエキスパート検証された質問からなり、知識リコール(L1)、ルーチン推論(L2)、個別推論(L3)の3つのプログレッシブ推論レベルを評価する。データ品質を確保するために、6人の歯科医が自動構築フレームワークを校正し、複数の選択と短問に対して99.98%、より複雑なケースベースの質問に対して96.78%の専門家合意率を達成した。 GlobalDentBench上での12個のフロンティアLCMの評価により, 推理複雑性が増大し, 段階的に性能劣化が認められた。具体的には、マルチチョイスで81.34%、短問で64.53%、ケースベースの質問で22.34%、L1で74.01%、L2で55.64%、L3で35.71%と顕著に低下した。より重要なことに、現実の歯科症例のリスク分析では、LSMを作成した臨床勧告では、安全でない全体の31.01%が警告され、4.51%が可逆的な患者の危害のリスク、特に矯正治療のような専門分野において顕著なリスクを呈していた。これらの知見は,現在のLSMの医学的推論と安全性の基本的な限界を明らかにしている。その結果、GlobalDentBenchは信頼できる臨床AI評価のためのスケーラブルな基盤を提供し、これらのモデルを医療に安全にデプロイする前に厳格な検証が必要であることを強調している。

論文の概要: GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

関連論文リスト