Fugu-MT 論文翻訳(概要): How to Evaluate Medical AI

論文の概要: How to Evaluate Medical AI

arxiv url: http://arxiv.org/abs/2509.11941v2
Date: Thu, 25 Sep 2025 09:31:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 14:16:56.068207
Title: How to Evaluate Medical AI
Title（参考訳）: 医療AIの評価方法
Authors: Ilia Kopanichuk, Petr Anokhin, Vladimir Shaposhnikov, Vladimir Makharev, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets,
Abstract要約: アルゴリズム診断(RPAD, RRAD)の相対精度とリコールについて紹介する。 RPADとRADは、AIの出力を単一の参照ではなく複数の専門家の意見と比較する。大規模な研究によると、DeepSeek-V3のようなトップパフォーマンスモデルは、専門家のコンセンサスに匹敵する一貫性を達成している。
参考スコア（独自算出の注目度）: 4.23552814358972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of artificial intelligence (AI) into medical diagnostic workflows requires robust and consistent evaluation methods to ensure reliability, clinical relevance, and the inherent variability in expert judgments. Traditional metrics like precision and recall often fail to account for the inherent variability in expert judgments, leading to inconsistent assessments of AI performance. Inter-rater agreement statistics like Cohen's Kappa are more reliable but they lack interpretability. We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) - a new evaluation metrics that compare AI outputs against multiple expert opinions rather than a single reference. By normalizing performance against inter-expert disagreement, these metrics provide a more stable and realistic measure of the quality of predicted diagnosis. In addition to the comprehensive analysis of diagnostic quality measures, our study contains a very important side result. Our evaluation methodology allows us to avoid selecting diagnoses from a limited list when evaluating a given case. Instead, both the models being tested and the examiners verifying them arrive at a free-form diagnosis. In this automated methodology for establishing the identity of free-form clinical diagnoses, a remarkable 98% accuracy becomes attainable. We evaluate our approach using 360 medical dialogues, comparing multiple large language models (LLMs) against a panel of physicians. Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus. Moreover, we demonstrate that expert judgments exhibit significant variability - often greater than that between AI and humans. This finding underscores the limitations of any absolute metrics and supports the need to adopt relative metrics in medical AI.
Abstract（参考訳）: 人工知能(AI)を医療診断ワークフローに統合するには、信頼性、臨床関連性、および専門家の判断における固有の多様性を保証するために、堅牢で一貫した評価方法が必要である。精度やリコールといった従来のメトリクスは、専門家の判断に固有の変動を考慮できないことが多く、AIのパフォーマンスの一貫性のない評価につながります。 Cohen's Kappaのようなラター間合意統計は信頼性が高いが、解釈性に欠ける。アルゴリズム診断の相対精度とリコール(RPADとRAD)は、AIの出力を単一の参照ではなく複数の専門家の意見と比較する新しい評価指標である。専門家間の不一致に対するパフォーマンスの正規化によって、これらの指標は予測された診断の品質をより安定かつ現実的に測定する。診断品質測定の包括的分析に加えて,本研究は極めて重要な副作用を含む。評価手法により,特定の症例を評価する際に,限られたリストから診断を選択することを避けることができる。代わりに、テスト対象のモデルと検証対象のモデルの両方が、フリーフォームで診断される。フリーフォーム臨床診断の同一性を確立するための自動化手法では, 98%の精度が達成できる。医療対話を360回実施し,複数大言語モデル (LLM) と医師パネルの比較を行った。大規模な研究によると、DeepSeek-V3のようなトップパフォーマンスモデルは、専門家のコンセンサスに匹敵する一貫性を達成している。さらに、専門家による判断は、しばしばAIと人間との差異よりも大きなばらつきを示します。この発見は、絶対的なメトリクスの限界を強調し、医療AIに相対的なメトリクスを採用する必要性をサポートする。

論文の概要: How to Evaluate Medical AI

関連論文リスト