Fugu-MT 論文翻訳(概要): EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

論文の概要: EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

arxiv url: http://arxiv.org/abs/2605.30637v1
Date: Thu, 28 May 2026 22:38:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-01 20:56:50.27476
Title: EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
Title（参考訳）: EHRBench: LLMを用いた臨床診断のための自動化された信頼性の高いEHRベースのベンチマーク
Authors: Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang,
Abstract要約: 臨床意思決定 (CDM) は、臨床医が診断を推測し、治療を選択し、不完全な証拠の下で将来の健康結果を予測する、現実的な臨床の中心である。 LLMモデルは、強力な言語能力、幅広い生物医学的知識、効率性のために、これらの決定をサポートするためにますます使われています。 LLMの実際の臨床決定タスクに対する信頼性は、まだ十分に理解されていない。
参考スコア（独自算出の注目度）: 51.129595320595094
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.
Abstract（参考訳）: 臨床意思決定 (CDM) は実際の臨床ワークフローの中心であり、臨床医は診断、治療の選択、あるいは不完全な証拠の下で将来の健康結果を予測する。 LLMは、言語能力、幅広い生物医学的知識、効率性により、これらの決定を支援するためにますます利用されているが、現実の臨床的決定タスクに対するLSMの信頼性は、まだ十分に理解されていない。 CDMモデル、特にLCMベースのモデルを評価するためには、スケールと品質の両方を確保するために、自動化されながら信頼性の高いパイプラインを通じて理想的で実用的な医療判断ベンチマークを構築する必要がある。さらに、実際の患者EHRにおけるCDMベンチマークの基盤は、実質的なバイオメディカル知識と臨床推測を必要とする実践的なCDMタスクの評価をより支援することができる。このギャップを埋めるために,LEMに基づく臨床意思決定を大規模に評価するための自動かつ信頼性の高いEHRグラウンドベンチマークであるEHRBenchを紹介する。スケーラビリティと信頼性を確保するため、EHRBenchはEHR-LLM-KB(knowledge-base)インタラクションパイプラインを通じて構築される。効率性のために、特殊LLMを用いて、遭遇レベルのERHトラジェクトリを構造化テンプレートに自動的に変換し、テンプレートをQA項目に決定的にインスタンス化する。並列に、系統的なKBベースの検証と強化を適用し、幻覚的あるいは曖昧な関係をフィルタリングし、信頼性を向上させる。このパイプラインを用いて, 診断, 治療, 予後の3つの中核的推論条件にまたがる約100M (960,067) のQA項目を構築した。 EHRBench上で30以上の代表LSMをベンチマークし、性能とロバスト性について詳細に分析する。以上の結果から, EHRBench の信頼性が向上し, 臨床に信頼性の高い LLM システムに対する有効性ギャップが強調された。

論文の概要: EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

関連論文リスト