Fugu-MT 論文翻訳(概要): MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

論文の概要: MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

arxiv url: http://arxiv.org/abs/2505.23802v2
Date: Mon, 02 Jun 2025 04:19:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-03 13:48:30.075339
Title: MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Title（参考訳）: MedHELM:医療用大規模言語モデルの全体的評価
Authors: Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert S. Chiou, Christy Hong, Mohana Roy, Michael F. Gensheimer, Hinesh Patel, Kevin Schulman, Dev Dash, Danton Char, Lance Downing, Francois Grolleau, Kameron Black, Bethel Mieso, Aydin Zahedivash, Wen-wai Yim, Harshita Sharma, Tony Lee, Hannah Kirsch, Jennifer Lee, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Anurang Revri, Yair Bannett, Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Akshay Chaudhari, Thomas Wang, Sanmi Koyejo, Matthew P. Lungren, Eric Horvitz, Percy Liang, Mike Pfeffer, Nigam H. Shah,
Abstract要約: 大規模言語モデル(LLM)は、医学試験においてほぼ完璧なスコアを得る。これらの評価は、実際の臨床実践の複雑さと多様性を不十分に反映している。 MedHELMは,医療業務におけるLCMの性能を評価するための評価フレームワークである。
参考スコア（独自算出の注目度）: 47.486705282473984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
Abstract（参考訳）: 大規模言語モデル (LLM) は, 医療ライセンス試験においてほぼ完全なスコアを得られるが, これらの評価は実世界の臨床実践の複雑さと多様性を不適切に反映している。 MedHELM は医療タスクにおける LLM のパフォーマンスを3つの重要なコントリビューションで評価するための拡張性評価フレームワークである。まず5つのカテゴリ,22のサブカテゴリ,121のタスクにまたがる,29の臨床専門医による検証を行った。第2に、35のベンチマーク(既存の17、新たに18)からなる包括的なベンチマークスイートが、分類学におけるすべてのカテゴリとサブカテゴリの完全なカバレッジを提供する。第3に,LLMの体系的比較と評価方法の改善(LLM-juryを用いた)と費用対効果分析を行った。 35のベンチマークを用いて,9つのフロンティアLLMの評価を行った結果,大きな性能変化が認められた。高度な推論モデル(DeepSeek R1: 66%のウィンレート、o3-mini: 64%のウィンレート)は優れた性能を示したが、Claude 3.5 Sonnetは予測計算コストを40%下回った。正常化精度スケール (0-1) では、ほとんどのモデルは、臨床ノート生成 (0.73-0.85) と患者コミュニケーションと教育 (0.78-0.83) で、医学研究支援 (0.65-0.75) では適度に、一般的には臨床決定支援 (0.56-0.72) と管理とワークフロー (0.53-0.63) で、強く実行された。 LLM-jury 評価法は, 平均的クリニシアン契約 (ICC = 0.43) とROUGE-L (0.36) とBERTScore-F1 (0.44) の両基準を越え, 臨床評価 (ICC = 0.47) と良好な一致を得た。クロード3.5 ソンネットは推定コストの低いトップモデルに匹敵する性能を達成した。これらの知見は、LLMの医療利用における実世界のタスク固有の評価の重要性を強調し、これを実現するためのオープンソースフレームワークを提供する。

関連論文リスト

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains [15.73821689524201]
大言語モデル (LLMs) は臨床決定支援において有望であるが、安全性評価と有効性検証において大きな課題に直面している。臨床専門家のコンセンサスに基づく多次元フレームワークであるCSEDBを開発した。 13名の専門医が, 現実のシナリオをシミュレートする26の臨床部門にまたがって, 2,069件のオープンエンドQ&A項目を作成した。
論文参考訳（メタデータ） (2025-07-31T12:10:00Z)
Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation [0.0]
Llama 3.2 1Bモデルを用いて,プライバシ保護・オンデバイス医療転写システムの開発と評価を行った。このモデルは、完全にブラウザ内で完全なデータ主権を維持しながら、医療転写から構造化された医療メモを生成することができる。
論文参考訳（メタデータ） (2025-07-03T01:51:49Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
医学的文脈における大きな言語モデル(LLM)の信頼性と精度は依然として重要な懸念点である。現在の評価手法はロバスト性に欠けることが多く、LLMの性能を総合的に評価することができない。我々は,これらの課題に対処するために,医療用LCMの特別設計評価フレームワークであるMed-CoDEを提案する。
論文参考訳（メタデータ） (2025-04-21T16:51:11Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
MedR-Benchは1,453例の構造化患者のベンチマークデータセットで、推論基準を付した注釈付きである。本稿では,3つの批判的診察勧告,診断決定,治療計画を含む枠組みを提案し,患者のケアジャーニー全体をシミュレートする。このベンチマークを用いて、DeepSeek-R1、OpenAI-o3-mini、Gemini-2.0-Flash Thinkingなど、最先端の5つのLCMを評価した。
論文参考訳（メタデータ） (2025-03-06T18:35:39Z)
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) [0.5434005537854512]
本研究では、MIRS(Master Interview Rating Scale)を用いたOSCE評価自動化のための大規模言語モデル(LLM)の可能性について検討した。ゼロショット,チェーン・オブ・シント(CoT),少数ショット,マルチステッププロンプトの条件下で,MIRSの28項目すべてにまたがるOSCE書き起こしの評価において,最先端の4つのLCMの性能を比較した。
論文参考訳（メタデータ） (2025-01-21T04:05:45Z)
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators [20.782328949004434]
大規模言語モデル (LLM) は、ライセンス試験を用いて一般的な医学的知識として評価されている。本研究は,35個の臨床電卓を対象に,1009個の質問応答ペアを用いたモデルについて検討した。 2人のアノテータは名目上、平均解答精度79.5%のLLMよりも優れていた。
論文参考訳（メタデータ） (2024-11-08T15:50:19Z)
oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness [4.118721833273984]
大規模言語モデル(LLM)は医学的応用の可能性を示すが、専門的な臨床知識が欠如していることが多い。 Retrieval Augmented Generation (RAG)は、ドメイン固有の情報によるカスタマイズを可能にし、医療に適している。本研究は,手術適応の判定と術前指導におけるRAGモデルの精度,整合性,安全性について検討した。
論文参考訳（メタデータ） (2024-10-11T00:34:20Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
MedS-Benchは大規模言語モデル(LLM)の性能を臨床的に評価するためのベンチマークである。 MedS-Benchは、臨床報告の要約、治療勧告、診断、名前付きエンティティ認識、医療概念説明を含む、11のハイレベルな臨床タスクにまたがる。 MedS-Insは58の医療指向言語コーパスで構成され、112のタスクで1350万のサンプルを収集している。
論文参考訳（メタデータ） (2024-08-22T17:01:34Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
我々は,emphDoctorをプレイヤとして,NPC間の動的医療相互作用をシミュレーションするフレームワークであるtextbfAI Hospitalを紹介した。この設定は臨床シナリオにおけるLCMの現実的な評価を可能にする。高品質な中国の医療記録とNPCを利用したマルチビュー医療評価ベンチマークを開発した。
論文参考訳（メタデータ） (2024-02-15T06:46:48Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
大型言語モデル(LLM)は、人間レベルの流布で自然言語の指示に従うことができる。医療のための現実的なテキスト生成タスクにおけるLCMの評価は依然として困難である。我々は、EHRデータのための983の自然言語命令のベンチマークデータセットであるMedAlignを紹介する。
論文参考訳（メタデータ） (2023-08-27T12:24:39Z)
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [31.884600238089405]
臨床研究に適したオープン・大型言語モデル(LLM)であるクリニカル・カメルについて述べる。 QLoRAを用いてLLaMA-2を微調整し,医療用LCMの医療用ベンチマークにおける最先端性能を実現する。
論文参考訳（メタデータ） (2023-05-19T23:07:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。