Related papers: Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment

Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment

URL: http://arxiv.org/abs/2412.00559v1
Date: Sat, 30 Nov 2024 19:02:34 GMT
Title: Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment
Authors: Łukasz Grzybowski, Jakub Pokrywka, Michał Ciesiółka, Jeremi I. Kaczmarek, Marek Kubis,
Abstract summary: This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams.<n>It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora.<n>We evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students.
Score: 0.865489625605814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.

Related papers

KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations [6.453078564406654]
KokushiMD-10 is the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams.<n>This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions.<n>It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning.
arXiv Detail & Related papers (2025-06-09T02:26:02Z)
Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025:A Comparative Analysis of Clinical Reasoning and Knowledge Application [0.0]
This study presents a comparative evaluation of 22 large language models LLMs on the Spanish Medical Intern Resident MIR examinations for 2024 and 2025. The MIR exam consists of 210 multiple choice questions some requiring image interpretation. The results underscore the transformative potential of domain specific fine tuning and multimodal integration in advancing medical AI applications.
arXiv Detail & Related papers (2025-02-24T12:08:26Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios. The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z)
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors. We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain [8.246368441549967]
We present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specifics to assess their cross-lingual capabilities.
arXiv Detail & Related papers (2024-02-20T23:54:02Z)
A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain. This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z)
Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare. In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z)
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.