Fugu-MT 論文翻訳(概要): CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

論文の概要: CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

arxiv url: http://arxiv.org/abs/2603.06183v1
Date: Fri, 06 Mar 2026 11:43:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.5831
Title: CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
Title（参考訳）: CRIMSON : LLM-based Metric for Generative Radiology Report Evaluation
Authors: Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar,
Abstract要約: CRIMSONは胸部X線レポート生成のための臨床基盤評価フレームワークである。エラーを、誤った発見、不明な発見、8つの属性レベルのエラーを含む包括的な分類に分類する。 CRIMSONは、6人の放射線技師によって注釈された臨床的に重要なエラー数と強く一致して検証される。
参考スコア（独自算出の注目度）: 2.61152955442649
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.
Abstract（参考訳）: CRIMSONは, 診断精度, コンテキスト関連性, 患者の安全性に基づいて報告を評価する胸部X線レポート作成のための臨床基礎的評価フレームワークである。以前の指標とは異なり、CRIMSONは、患者の年齢、指示、ガイドラインに基づく決定規則を含む完全な臨床コンテキストを取り入れており、正常または臨床的に重要な発見が全体的なスコアに不均等な影響を与えるのを防ぐ。このフレームワークはエラーを、誤った発見、不明な発見、属性レベルの8つのエラー(例えば、位置、重大度、測定、診断過剰解釈)を含む包括的な分類に分類する。それぞれの発見には、臨床上の重要なレベル(即効性、作用性、非作用性、または期待/良性)が割り当てられており、心胸部放射線科医と共同で開発されたガイドラインに基づいて、臨床的に連続した誤りを良性差よりも優先する重み付けを可能にする。 CRIMSONは、ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84) の6人の放射線学者によって注釈付けされた臨床的に重要なエラー数と強く一致し、さらに2つの追加ベンチマークによって検証される。臨床的に困難なパスフェイルシナリオを対象とするRadJudgeでは、CRIMSONが専門家の判断と一貫した一致を示している。 RadPrefでは、構造的エラー分類、重度モデリング、および3人の心胸部放射線科医による1-5の総合的な品質評価を含む100以上のペアワイドの放射線科医の選好ベンチマークが実施され、CRIMSONは放射線科医の選好と最強の一致を達成している。我々は、測定値、評価ベンチマーク、RadJudgeとRadPref、および微調整されたMedGemmaモデルをリリースし、レポート生成の再現可能な評価を可能にする。

論文の概要: CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

関連論文リスト