Fugu-MT 論文翻訳(概要): A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

論文の概要: A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

arxiv url: http://arxiv.org/abs/2507.23486v1
Date: Thu, 31 Jul 2025 12:10:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-01 17:19:09.72911
Title: A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Title（参考訳）: 医療用LLMの新しい評価基準 : 臨床領域における安全性と有効性
Authors: Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu,
Abstract要約: 大言語モデル (LLMs) は臨床決定支援において有望であるが、安全性評価と有効性検証において大きな課題に直面している。臨床専門家のコンセンサスに基づく多次元フレームワークであるCSEDBを開発した。 13名の専門医が, 現実のシナリオをシミュレートする26の臨床部門にまたがって, 2,069件のオープンエンドQ&A項目を作成した。
参考スコア（独自算出の注目度）: 15.73821689524201
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
Abstract（参考訳）: 大言語モデル (LLMs) は臨床決定支援において有望であるが、安全性評価と有効性検証において大きな課題に直面している。臨床専門家によるコンセンサスに基づく多次元フレームワークであるCSEDB(Citial Safety-Effectiveness Dual-Track Benchmark)を開発した。 13名の専門医が, 現実のシナリオをシミュレートする26の臨床部門にまたがって, 2,069件のオープンエンドQ&A項目を作成した。 6台のLCMのベンチマークテストでは、平均スコア57.2%、安全性54.7%、有効性62.3%)が適度に改善され、ハイリスクシナリオでは13.3%のパフォーマンス低下(p < 0.0001)が見られた。ドメイン固有の医療用LLMは,安全性(0.912)と有効性(0.861)において,汎用モデルよりも一貫した性能上の優位性を示した。本研究は, 医療用LLMの臨床応用評価のための標準化された指標を提供するとともに, 比較分析, リスク暴露識別, 改善方向を異なるシナリオに展開するだけでなく, 医療環境における大規模言語モデルのより安全かつ効果的な展開を促進する可能性も持っている。

関連論文リスト

HIVMedQA: Benchmarking large language models for HIV medical decision support [0.0]
HIV管理は、その複雑さのために魅力的なユースケースである。大規模言語モデル(LLM)を臨床実践に統合すると、正確性、潜在的な害、臨床受理に関する懸念が高まる。本研究は、HIV治療におけるLSMの現在の能力を評価し、その強度と限界を強調した。
論文参考訳（メタデータ） (2025-07-24T07:06:30Z)
Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [51.73411055162861]
本稿では,患者と臨床医の両方の視点で医療領域に適した安全評価プロトコルを提案する。医療用LLMの安全性評価基準を3つの異なる視点を取り入れたレッドチームで定義した最初の研究である。
論文参考訳（メタデータ） (2025-07-09T19:38:58Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
大規模言語モデル(LLM)は、医学試験においてほぼ完璧なスコアを得る。これらの評価は、実際の臨床実践の複雑さと多様性を不十分に反映している。 MedHELMは,医療業務におけるLCMの性能を評価するための評価フレームワークである。
論文参考訳（メタデータ） (2025-05-26T22:55:49Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
医学的文脈における大きな言語モデル(LLM)の信頼性と精度は依然として重要な懸念点である。現在の評価手法はロバスト性に欠けることが多く、LLMの性能を総合的に評価することができない。我々は,これらの課題に対処するために,医療用LCMの特別設計評価フレームワークであるMed-CoDEを提案する。
論文参考訳（メタデータ） (2025-04-21T16:51:11Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
MedR-Benchは1,453例の構造化患者のベンチマークデータセットで、推論基準を付した注釈付きである。本稿では,3つの批判的診察勧告,診断決定,治療計画を含む枠組みを提案し,患者のケアジャーニー全体をシミュレートする。このベンチマークを用いて、DeepSeek-R1、OpenAI-o3-mini、Gemini-2.0-Flash Thinkingなど、最先端の5つのLCMを評価した。
論文参考訳（メタデータ） (2025-03-06T18:35:39Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBenchは、14のエキスパートによるコア臨床シナリオを備えた総合的なベンチマークである。このベンチマークの信頼性はいくつかの点で確認されている。
論文参考訳（メタデータ） (2024-10-04T15:15:36Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
臨床知識を符号化するための言語モデル(LLM)が示されている。 6つの最先端モデルをベンチマークする評価フレームワークであるSemioLLMを提案する。ほとんどのLSMは、脳内の発作発生領域の確率的予測を正確かつ確実に生成できることを示す。
論文参考訳（メタデータ） (2024-07-03T11:02:12Z)
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [31.884600238089405]
臨床研究に適したオープン・大型言語モデル(LLM)であるクリニカル・カメルについて述べる。 QLoRAを用いてLLaMA-2を微調整し,医療用LCMの医療用ベンチマークにおける最先端性能を実現する。
論文参考訳（メタデータ） (2023-05-19T23:07:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。