Related papers: KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

URL: http://arxiv.org/abs/2602.13650v1
Date: Sat, 14 Feb 2026 07:42:04 GMT
Title: KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
Authors: Byungjin Choi, Seongsu Bae, Sunjun Kweon, Edward Choi,
Abstract summary: KorMedMCQA-V is a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark.<n>The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations.
Score: 16.50828571559655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.

Related papers

Enabling Ultra-Fast Cardiovascular Imaging Across Heterogeneous Clinical Environments with a Generalist Foundation Model and Multimodal Database [64.65360708629485]
MMCMR-427K is the largest and most comprehensive multimodal cardiovascular magnetic resonance k-space database.<n> CardioMM is a reconstruction foundation model capable of adapting to heterogeneous fast CMR imaging scenarios.<n> CardioMM unifies semantic contextual understanding with physics-informed data consistency to deliver robust reconstructions.
arXiv Detail & Related papers (2025-12-25T12:47:50Z)
CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging [2.9857131541387827]
We introduce CrossMed, a benchmark to evaluate compositional generalization (CG) in medical vision-language models.<n>We reformulate four public datasets into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances.<n>Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions.
arXiv Detail & Related papers (2025-11-14T07:41:01Z)
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding [112.46150793476603]
We introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM)<n>Hulu-Med is trained on a curated corpus of 16.7 million samples, spanning 12 major anatomical systems and 14 medical imaging modalities.<n>Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks.
arXiv Detail & Related papers (2025-10-09T17:06:42Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning [24.9872402922819]
Existing medical VQA benchmarks mostly focus on single-image analysis.<n>We introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
arXiv Detail & Related papers (2025-05-22T17:46:11Z)
A Lightweight Large Vision-language Model for Multimodal Medical Images [0.06990493129893112]
Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries.<n>We introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing.<n>Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications.
arXiv Detail & Related papers (2025-04-08T00:19:48Z)
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations [7.8387874506025215]
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark.<n>The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist.
arXiv Detail & Related papers (2024-03-03T10:31:49Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation [5.547422331445511]
We report quantitative and qualitative zero-shot segmentation results on nine medical image segmentation benchmarks. Our study indicates the versatility of generalist vision foundation models on medical imaging.
arXiv Detail & Related papers (2023-04-25T08:07:59Z)
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images. We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.