Fugu-MT 論文翻訳(概要): Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

論文の概要: Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

arxiv url: http://arxiv.org/abs/2509.24186v1
Date: Mon, 29 Sep 2025 02:06:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.686854
Title: Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models
Title（参考訳）: 総合的正確性を超えて:80大言語モデルのトピック特化医療能力に関する心理学的ディープディーブ
Authors: Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He,
Abstract要約: 項目応答理論(IRT)に基づく厳密な評価フレームワークであるtextscMedIRT を紹介する。 80の多種多様な言語モデル (LLMs) から, バランスのとれた1,100のUSMLE準拠のベンチマークで, 新たな回答を期待して収集した。 LLMの潜在モデル能力は質問の難易度や識別と共同で推定し、精度のみよりも安定でニュアンスの高い性能ランキングを得る。
参考スコア（独自算出の注目度）: 6.362188639024662
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky'' ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT's utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.
Abstract（参考訳）: 大規模言語モデル (LLM) は, 高い精度の医療応用のためにますます提案されているため, 信頼性と正確な評価手法の必要性が高まっている。従来の精度の指標は、質問の特徴を捉えたり、トピック固有の洞察を提供したりすることができないため、不十分である。このギャップに対処するために,高水準の教育試験におけるゴールドスタンダードである IRT (Item Response Theory) に基づく厳密な評価フレームワークである \textsc{MedIRT} を紹介した。アーカイブデータに依存する以前の研究とは異なり、我々は、バランスのとれた1,100のUSMLEアライメントベンチマークで、80の多様なLSMから新しいレスポンスを収集した。 1トピックあたりの1次元2パラメータロジスティックIRTモデルを用いて、LLMの潜在モデル能力と疑問の難易度と差別度を併用して推定し、精度のみよりも安定かつニュアンスの高い性能ランキングを得る。特に,高度に専門化されたモデル能力により,総合的なランキングが誤解を招く可能性のある,特有の'spiky'能力プロファイルを同定する。テキストト{GPT-5}は、多くのドメイン(11のうち8つ)でトップパフォーマーであったが、ソーシャルサイエンスとコミュニケーションでは「テキストト{Claude-3-opus}」より優れており、全体的な23位モデルでさえ特定の能力でトップの座を保てることを示した。さらに、欠陥のある問題を特定することで、ベンチマークの監査におけるIRTの有用性を実証する。我々はこれらの知見を,我々の多要素能力プロファイルと運用メトリクスを統合した,実践的な意思決定支援フレームワークに合成する。この研究は、医療におけるLLMの安全で効果的で信頼性の高い展開に不可欠な、堅牢で心理的に根ざした方法論を確立する。

論文の概要: Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

関連論文リスト