Fugu-MT 論文翻訳(概要): A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

論文の概要: A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

arxiv url: http://arxiv.org/abs/2604.25933v1
Date: Fri, 03 Apr 2026 14:50:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 02:32:14.231179
Title: A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework
Title（参考訳）: LLM-as-a-Judge in Healthcare and the MedJUDGE Framework
Authors: Chenyu Li, Zohaib Akhtar, Mingu Kwak, Yuelyu Ji, Hang Zhang, Tracey Obi, Yufan Ren, Xizhi Wu, Sonish Sivarajkumar, Harold P. Lehmann, Shyam Visweswaran, Michael J. Becich, Danielle L. Mowery, Renxuan Liu, Haoyang Sun, Yanshan Wang,
Abstract要約: LLM-as-a-Judge (LaaJ) は大規模言語モデルを用いてモデル出力を評価する。採用が増えたにもかかわらず、バリデーションの厳格さは限られていた。偏見検査のリスクは36の研究 (73.5%) で欠落しており、人口統計学的公正性はわずか1 (2.0%) であり、時間的安定性や患者の状況は評価されなかった。臨床リスク層にまたがる妥当性,安全性,説明責任を重視したリスク階層化3ピラーフレームワークであるMedJUDGEを提案する。
参考スコア（独自算出の注目度）: 11.502207790112344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.
Abstract（参考訳）: 大規模言語モデル(LLM)が臨床テキストを生成し,処理するにつれて,スケーラブルな評価が重要になっている。 LLM-as-a-Judge(LaaJ)はモデル出力の評価にLLMを使用するが、コストのかかる専門家レビューに代わるスケーラブルな代替手段を提供する。 PRISMA-ScRスコーピングによる6つのデータベース(2020年1月～2026年1月)のレビューを行い,11,727件,49件について検討した。ランドスケープは評価とベンチマークのアプリケーション(n=37, 75.5%)、ポイントワイドスコア(n=42, 85.7%)、GPTファミリーの審査員(n=36, 73.5%)によって支配された。採用率の増加にもかかわらず、検証の厳格さは制限されており、36の研究のうち、専門家のバリデーターの中央値は3人であり、13人(26.5%)は誰も使っていなかった。偏見検査のリスクは36の研究 (73.5%) で欠落しており、人口統計学的公正性はわずか1 (2.0%) であり、時間的安定性や患者の状況は評価されなかった。配備は限定的であり、1つの研究(2.0%)が生産され、4つのプロトタイプ(8.2%)が生産された。判断と評価されたシステムがトレーニングデータやアーキテクチャを共有している場合、同様の盲点を継承する可能性がある。最小限の人間の監視、限られたバイアス評価、モデルモノカルチャーは、現在のバリデーションが臨床的に重大なエラーを見逃す可能性のあるガバナンスギャップを表している。そこで我々は,医療用LaaJシステムに対して,医療用LaaJシステムに対するデプロイ指向評価ガイダンスを提供することにより,医療用LaaJGE(医療用Utility, De-biasing,Government and Evaluation)を提案する。

論文の概要: A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

関連論文リスト