Fugu-MT 論文翻訳(概要): MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

論文の概要: MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arxiv url: http://arxiv.org/abs/2605.20197v1
Date: Sun, 05 Apr 2026 14:11:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 12:34:33.961367
Title: MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
Title（参考訳）: MedicalBench: 医療概念抽出の改善に向けた大規模言語モデルの評価
Authors: Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman,
Abstract要約: 医療概念抽出のためのベンチマークであるMedicalBenchを根拠として提示する。 MIMIC-IV放電サマリーと人間検証ICD-10コードから構築されたデータセットは、大きな言語モデル(LLM)トリアージパイプラインを通じてキュレートされる。 MedicalBenchは、暗黙の根拠に基づく医療概念抽出のための最初の体系的なベンチマークを提供する。
参考スコア（独自算出の注目度）: 1.1371912210771806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.
Abstract（参考訳）: 電子的な医療記録から医療概念を抽出することは、多くの下流の応用を支えているが、医学的な意味のある概念は、医学的物語で明確に述べられているのではなく、しばしば示唆されるため、依然として困難である。既存のヒトの注釈付きエビデンスによるベンチマークは、抽出された概念を医学的テキストに根拠付けることの重要性を浮き彫りにしている。しかし、彼らは主に暗黙的な概念ではなく明示的な概念に焦点を当てた。医学的概念抽出のためのベンチマークであるMedicalBenchについて,暗黙の医学的推論を評価する証拠を根拠として紹介する。 MedicalBenchは、医療用ノートとコンセプトペアの検証タスクとして医療概念抽出を定式化し、文章レベルの証拠を識別する。 MIMIC-IV放電サマリーと人間検証ICD-10コードから構築されたデータセットは、多段階の大規模言語モデル(LLM)トリアージパイプラインを通じてキュレートされ、その後医療アノテーションと専門家レビューが続く。故意に、暗黙的な肯定、意味的に不愉快な否定、LLMの判断が医学専門家の評価と矛盾する事例を含む。本研究では,(1)医学的概念抽出と(2)文章レベルのエビデンス検索の2つの補完的評価課題を定義し,正確性と解釈可能性の両立を可能にする。最先端のLCMのベンチマークでは、性能は控えめであり、暗黙的に表現された概念を抽出することの難しさを浮き彫りにしている。 MedicalBenchは、表面的な共同設立者ではなく、推論の難しさを分離していることを示している。 MedicalBenchは、暗黙的で根拠に基づく医療概念抽出のための最初の体系的なベンチマークを提供し、医療関連の概念を識別し、その予測を透明で医療に忠実な方法で正当化できる医療言語モデルを開発する基盤を提供する。

論文の概要: MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

関連論文リスト