Fugu-MT 論文翻訳(概要): A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

論文の概要: A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

arxiv url: http://arxiv.org/abs/2603.25196v1
Date: Thu, 26 Mar 2026 09:00:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.20082
Title: A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Title（参考訳）: マルチターン会話におけるLCMs臨床実習ガイドラインの検出と適応性評価のためのDcade-Scaleベンチマーク
Authors: Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen,
Abstract要約: 大規模言語モデル(LLM)は、医療シナリオにますます多くデプロイされている。 LLMが会話中に臨床ガイドラインを特定・遵守できるのかは不明確である。 CPGBenchは、LSMの臨床ガイドラインの検出と付着能力をベンチマークする自動フレームワークである。
参考スコア（独自算出の注目度）: 60.2076951536797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.
Abstract（参考訳）: 臨床実践ガイドライン(CPG)は、エビデンスに基づく意思決定を確実にし、患者の成果を改善する上で重要な役割を担っている。大きな言語モデル(LLM)は医療のシナリオにますます導入されているが、LLMの拡張が会話中にCPGを識別し、準拠するかどうかは不明だ。このギャップに対処するために,マルチターン会話におけるLCMの臨床的ガイドライン検出とアテンデンス能力のベンチマークを行う自動フレームワークであるCPGBenchを紹介する。我々は過去10年間に24の専門分野にまたがる9カ国・地域と2つの国際機関から3,418件のCPG文書を収集している。これらの資料から、対応する出版機関、日付、国、専門性、推薦力、証拠レベル等で32,155の臨床勧告を抽出する。 8つのLLMの検出と定着能力を評価するため、レコメンデーション毎に1つのマルチターン会話を生成する。 71.1%-89.6%の推奨は正しく検出できるが、対応するタイトルは3.6%-29.7%しか正しく参照できない。付着率は、異なるモデルで21.8%から63.2%の範囲であり、ガイドラインを知ることとそれらを適用することの間に大きなギャップがあることを示している。自動分析の有効性を確認するため,異なる専門分野の56名の臨床医を対象とした総合的な人的評価を行った。我々の知る限り、CPGBenchは、LLMが会話中の検出や定着に失敗する臨床勧告を体系的に明らかにする最初のベンチマークである。各臨床勧告が人口に影響を及ぼし、臨床応用が本質的に安全に重要であることを考えると、これらのギャップに対処することは、現実の臨床実践におけるLLMの安全かつ責任ある展開に不可欠である。

論文の概要: A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

関連論文リスト