Fugu-MT 論文翻訳(概要): Assessing LLMs' Performance: Insights from the Chinese Pharmacist Exam

論文の概要: Assessing LLMs' Performance: Insights from the Chinese Pharmacist Exam

arxiv url: http://arxiv.org/abs/2511.20526v1
Date: Tue, 25 Nov 2025 17:31:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.578435
Title: Assessing LLMs' Performance: Insights from the Chinese Pharmacist Exam
Title（参考訳）: LLMのパフォーマンスを評価する:中国の薬剤師の成果から
Authors: Xinran Wang, Boran Zhu, Shujuan Zhou, Ziwen Long, Dehua Zhou, Shu Zhang,
Abstract要約: 中国では、国家薬剤師試験は、薬剤師の臨床的および理論的能力を評価するための標準化されたベンチマークとして機能する。本研究では,ChatGPT-4oとDeepSeek-R1の2つの大規模言語モデルの性能を比較することを目的とした。
参考スコア（独自算出の注目度）: 9.07457306513003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background: As large language models (LLMs) become increasingly integrated into digital health education and assessment workflows, their capabilities in supporting high-stakes, domain-specific certification tasks remain underexplored.In China, the national pharmacist licensure exam serves as a standardized benchmark for evaluating pharmacists' clinical and theoretical competencies. Objective: This study aimed to compare the performance of two LLMs: ChatGPT-4o and DeepSeek-R1 on real questions from the Chinese Pharmacist Licensing Examination (2017-2021), and to discuss the implications of these performance differences for AI-enabled formative evaluation. Methods: A total of 2,306 multiple-choice (text-only) questions were compiled from official exams, training materials, and public databases. Questions containing tables or images were excluded. Each item was input in its original Chinese format, and model responses were evaluated for exact accuracy. Pearson's Chi-squared test was used to compare overall performance, and Fisher's exact test was applied to year-wise multiple-choice accuracy. Results: DeepSeek-R1 outperformed ChatGPT-4o with a significantly higher overall accuracy (90.0% vs. 76.1%, p < 0.001). Unit-level analyses revealed consistent advantages for DeepSeek-R1, particularly in foundational and clinical synthesis modules. While year-by-year multiple-choice performance also favored DeepSeek-R1, this performance gap did not reach statistical significance in any specific unit-year (all p > 0.05). Conclusion: DeepSeek-R1 demonstrated robust alignment with the structural and semantic demands of the pharmacist licensure exam. These findings suggest that domain-specific models warrant further investigation for this context, while also reinforcing the necessity of human oversight in legally and ethically sensitive contexts.
Abstract（参考訳）: 背景: 大規模言語モデル(LLMs)がデジタルヘルス教育やアセスメントのワークフローに統合されるにつれて、ハイテイクなドメイン固有の認定タスクを支援する能力はいまだ探索されていないが、中国では、薬剤師の臨床的、理論的能力を評価するための標準化されたベンチマークとして、国家薬剤師免許試験が機能している。目的: この研究は,中国の薬剤師ライセンス試験(2017-2021)の実際の質問に対して,ChatGPT-4oとDeepSeek-R1の2つのLLMのパフォーマンスを比較し,これらの性能差がAIによる形式的評価に与える影響について議論することを目的とした。メソッド: 公式試験, トレーニング資料, 公開データベースから, 合計2,306問(テキストのみ)の質問がまとめられた。表や画像を含む質問は除外された。各項目は元の中国語形式で入力され、モデル応答は正確な精度で評価された。ピアソンのチ二乗検定は全体の性能を比較するのに使われ、フィッシャーの正確な検定は年次多重選点精度に適用された。結果:DeepSeek-R1ではChatGPT-4oが90.0%,76.1%,p < 0.001)に優れていた。ユニットレベルの分析では、DeepSeek-R1、特に基礎および臨床合成モジュールに対して一貫した優位性を示した。年ごとの複数選択のパフォーマンスもDeepSeek-R1を好んだが、このパフォーマンスギャップは特定の単位年(p > 0.05)で統計的に重要ではなかった。結論: DeepSeek-R1 は薬剤師免許試験の構造的および意味的要求と堅牢な一致を示した。これらの知見は、ドメイン固有のモデルは、法的および倫理的に敏感な文脈における人間の監視の必要性を補強しつつ、この文脈についてさらなる調査を保証していることを示唆している。

論文の概要: Assessing LLMs' Performance: Insights from the Chinese Pharmacist Exam

関連論文リスト