KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
- URL: http://arxiv.org/abs/2403.01469v3
- Date: Mon, 09 Dec 2024 06:52:13 GMT
- Title: KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
- Authors: Sunjun Kweon, Byungjin Choi, Gyouk Chu, Junyeong Song, Daeun Hyeon, Sujin Gan, Jueon Kim, Minkyu Kim, Rae Woong Park, Edward Choi,
- Abstract summary: We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark.<n>The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist.
- Score: 7.8387874506025215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59 large language models, spanning proprietary and open-source models, multilingual and Korean-specialized models, and those fine-tuned for clinical applications. Our results show that applying Chain of Thought (CoT) reasoning can enhance the model performance by up to 4.5% compared to direct answering approaches. We also investigate whether MedQA, one of the most widely used medical benchmarks derived from the U.S. Medical Licensing Examination, can serve as a reliable proxy for evaluating model performance in other regions-in this case, Korea. Our correlation analysis between model scores on KorMedMCQA and MedQA reveals that these two benchmarks align no better than benchmarks from entirely different domains (e.g., MedQA and MMLU-Pro). This finding underscores the substantial linguistic and clinical differences between Korean and U.S. medical contexts, reinforcing the need for region-specific medical QA benchmarks. To support ongoing research in Korean healthcare AI, we publicly release the KorMedMCQA via Huggingface.
Related papers
- KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination [16.50828571559655]
KorMedMCQA-V is a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark.<n>The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations.
arXiv Detail & Related papers (2026-02-14T07:42:04Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations [6.453078564406654]
KokushiMD-10 is the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams.<n>This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions.<n>It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning.
arXiv Detail & Related papers (2025-06-09T02:26:02Z) - LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z) - Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment [0.865489625605814]
This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams.
It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora.
We evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students.
arXiv Detail & Related papers (2024-11-30T19:02:34Z) - AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset [8.521691388707799]
We introduce AfriMed-QA, the first large scale Pan-African English multi-choice medical Question-Answering dataset.
15,000 questions were sourced from over 60 medical schools across 16 countries, covering 32 medical specialties.
We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score.
arXiv Detail & Related papers (2024-11-23T19:43:02Z) - A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA)
Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions.
In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors.
We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese.
EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z) - Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - MedConceptsQA: Open Source Medical Concepts QA Benchmark [0.07083082555458872]
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering.
The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs.
We conducted evaluations of the benchmark using various Large Language Models.
arXiv Detail & Related papers (2024-05-12T17:54:50Z) - MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning [0.0]
This paper addresses the limitations of traditional methods by proposing a weakly supervised learning approach for open-ended medical question-answering (QA)
Our system leverages readily available MEDIQA-M3G images via a VGG16-CNN-SVM model, enabling multilingual learning of informative skin condition representations.
This work advances medical QA research, paving the way for clinical decision support systems and ultimately improving healthcare delivery.
arXiv Detail & Related papers (2024-04-27T20:03:47Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - BiMediX: Bilingual Medical Mixture of Experts LLM [94.85518237963535]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.
Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.
We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - Towards Expert-Level Medical Question Answering with Large Language
Models [16.882775912583355]
Large language models (LLMs) have catalyzed significant progress in medical question answering.
Here we present MedPaLM 2, which bridges gaps by leveraging combination of base improvements (PaLM 2), medical domain fine improvements, and prompting strategies.
We also observed approaching or exceeding state-the-art across MedMC-ofQA, PubMed, MMLU clinical topics datasets.
arXiv Detail & Related papers (2023-05-16T17:11:29Z) - GPT-4 can pass the Korean National Licensing Examination for Korean
Medicine Doctors [9.374652839580182]
This study assessed the capabilities of GPT-4 in traditional Korean medicine (TKM)
We optimized prompts with Chinese-term annotation, English translation for questions and instruction, exam-optimized instruction, and self-consistency.
GPT-4 with optimized prompts achieved 66.18% accuracy, surpassing the examination's average pass mark of 60% and the 40% minimum for each subject.
arXiv Detail & Related papers (2023-03-31T05:43:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.