Fugu-MT 論文翻訳(概要): Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

論文の概要: Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

arxiv url: http://arxiv.org/abs/2604.24710v1
Date: Mon, 27 Apr 2026 17:17:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.263733
Title: Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters
Title（参考訳）: 臨床AI評価のための事例--方法論・検証・LCM-クリニシアン協定(第823回)
Authors: Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez,
Abstract要約: スコアリングインスタンス毎のエキスパートレビューを必要とするメソッドは、安全で反復的なデプロイメントには遅すぎるし、コストも高くつく。 20人の臨床医が、プライマリケア、精神医学、腫瘍学、行動保健の823の患者に1,646個のルーブリックを作成した。ケース固有のルーリックは、専門家の判断を維持しながら3桁のコストで自動化を可能にする、臨床AI評価のためのパスを提供する。
参考スコア（独自算出の注目度）: 3.018184429993625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.
Abstract（参考訳）: 目的。臨床AIドキュメンテーションシステムは、臨床的に有効であり、経済的に有効であり、反復的な変化に敏感である評価方法を必要とする。スコアリングインスタンス毎のエキスパートレビューを必要とするメソッドは、安全で反復的なデプロイメントには遅すぎるし、コストも高くつく。臨床用AI評価のための症例特異的な臨床用ルーブリック法を提案し, 臨床用ルーブリックが臨床用ルーブリックとほぼ一致するかどうかを検討した。材料と方法。 20人の臨床医が、プライマリケア、精神医学、腫瘍学、行動保健に関する823の症例(実世界736件、総合的87件)のために1,646件のルーブリックを著した。各ルーブリックは, LLMをベースとしたスコアリングエージェントが, 臨床医が好むアウトプットが, 拒否されたものよりも高い点を連続的に評価することを確認することによって検証された。 EHRを組み込んだ臨床用AIエージェントの7つのバージョンを全症例で評価した。結果。臨床著者による潤滑剤は、高いスコア安定性(中間値の0.00%)を持つ高品質と低品質の出力(中間値のギャップ:82.9%)を効果的に判別した。メディアスコアは84%から95%に改善した。後の実験では、臨床医とLLMのランキング合意(タウ:0.42-0.46)は、天井圧縮とLCMのルーリック改善の両方に起因するクリニカル・クリニック合意(タウ:0.38-0.43)に一致または超えた。議論。この収束は、臨床医が認可したものと共にLSMルーブリックを組み込むのをサポートする。約1000倍のコストで、LSMルーブリックは極めて高い評価範囲を達成し、一方、臨床著者は専門家の判断において評価を継続する。シーリング圧縮は将来のラター間合意研究の方法論的課題である。結論。ケース固有のルーリックは、専門家の判断を維持しながら3桁のコストで自動化を可能にする、臨床AI評価のためのパスを提供する。臨床が認可されたルーリックは、LCMルーリックが検証されるベースラインを確立する。

論文の概要: Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

関連論文リスト