PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
- URL: http://arxiv.org/abs/2509.11517v1
- Date: Mon, 15 Sep 2025 02:07:26 GMT
- Title: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
- Authors: Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca,
- Abstract summary: AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training.<n>We curated PeruMedQA, a multiple-choice question-answering dataset containing 8,380 questions spanning 12 medical domains.<n>Medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances.
- Score: 0.6899744489931012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.
Related papers
- Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks [72.89088985703748]
The rise of large language models (LLMs) has transformed healthcare by offering clinical guidance, yet their direct deployment to patients poses safety risks.<n>We propose repositioning LLMs as clinical assistants that collaborate with experienced physicians rather than interacting with patients directly.<n>We construct DoctorFLAN, a large-scale Chinese medical dataset comprising 92,000 Q&A instances across 22 clinical tasks and 27 specialties.
arXiv Detail & Related papers (2025-10-13T06:18:27Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering [29.289606699835293]
Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks.<n>We introduce the MedPAIR dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions.
arXiv Detail & Related papers (2025-05-29T22:23:48Z) - AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset [8.521691388707799]
We introduce AfriMed-QA, the first large scale Pan-African English multi-choice medical Question-Answering dataset.<n>15,000 questions were sourced from over 60 medical schools across 16 countries, covering 32 medical specialties.<n>We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score.
arXiv Detail & Related papers (2024-11-23T19:43:02Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)<n>LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.<n>We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset [31.047827145874844]
We introduce CMExam, sourced from the Chinese National Medical Licensing Examination.
CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner.
For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
arXiv Detail & Related papers (2023-06-05T16:48:41Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Towards Expert-Level Medical Question Answering with Large Language
Models [16.882775912583355]
Large language models (LLMs) have catalyzed significant progress in medical question answering.
Here we present MedPaLM 2, which bridges gaps by leveraging combination of base improvements (PaLM 2), medical domain fine improvements, and prompting strategies.
We also observed approaching or exceeding state-the-art across MedMC-ofQA, PubMed, MMLU clinical topics datasets.
arXiv Detail & Related papers (2023-05-16T17:11:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.