Fugu-MT 論文翻訳(概要): PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

論文の概要: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

arxiv url: http://arxiv.org/abs/2509.11517v1
Date: Mon, 15 Sep 2025 02:07:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 17:26:23.122633
Title: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Title（参考訳）: PeruMedQA: ペルーの医療機関における大規模言語モデル(LLM)のベンチマーク - データセットの構築と評価
Authors: Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca,
Abstract要約: AIMS: 専門的な訓練を追求するペルーの医師が取得した医学検査から質問のデータセットを構築する。 12の医療領域にまたがる8,380の質問を含む多票質問回答データセットであるPulchuMedQAをキュレートした。 Medgemma-27b-text-it は他の全てのモデルよりも優れており、いくつかの例では90%を超える正解率を達成した。
参考スコア（独自算出の注目度）: 0.6899744489931012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.
Abstract（参考訳）: BACKGROUND: 医学大言語モデル(LLMS)は, 医学的検査に答える上で, 顕著な性能を示した。しかし、このハイパフォーマンスがスペイン語やラテンアメリカの国から医学的な問題に移行できる範囲は未解明のままである。 LLMベースの医療応用がラテンアメリカで勢いを増すにつれ、この知識は不可欠である。 AIMS: 専門訓練を追求するペルーの医師が取得した医学検査のデータセットを構築し、このデータセット上でLSMを微調整し、バニラLSMと微調整LDMの精度で性能を評価し比較する。方法】12の医療領域(2018-2025)にまたがる8,380の質問を含むMCQAデータセットであるPerulMedQAをキュレートした。我々は,medgemma-4b-itとmedgemma-27b-text-itを含む8つの医療用LCMを選択し,ゼロショットタスク特異的なプロンプトを作成した。 2025(テストセット)を除く全ての質問に対して,パラメータ効率のよい微調整(PEFT)と低域適応(LoRA)を適用した。結果: medgemma-27b-text-it は他の全てのモデルよりも優れており、いくつかのケースで90%を超える正解率を達成した。 LLMは100億のパラメータで60%の正解を示したが、一部の試験では50%の正解を示した。 medgemma-4b-itの微調整版は、100億のパラメータを持つ全てのLLMに再び勝利し、様々な試験で700億のパラメータを持つLLMと競合した。 CONCLUSIONS: スペイン語圏諸国の知識ベースとペルーと類似の疫学的プロファイルを必要とする医療AIアプリケーションと研究のためには、関心のある当事者はmedgemma-27b-text-itまたはmedgemma-4b-itの微調整版を利用するべきである。

論文の概要: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

関連論文リスト