Related papers: MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

URL: http://arxiv.org/abs/2507.20917v1
Date: Mon, 28 Jul 2025 15:17:48 GMT
Title: MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
Authors: Adrien Bazoge,
Abstract summary: MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects.<n>The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer.
Score: 0.7770029179741429
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

Related papers

MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
MedConceptsQA: Open Source Medical Concepts QA Benchmark [0.07083082555458872]
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. We conducted evaluations of the benchmark using various Large Language Models.
arXiv Detail & Related papers (2024-05-12T17:54:50Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale [19.94415334436024]
We devise a semi-automated annotation process to streamline data preparation and build new benchmark MedVQA datasets. These datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations. We also design a novel framework, MedThink, which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales.
arXiv Detail & Related papers (2024-04-18T17:53:19Z)
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions [19.436999992810797]
We construct two new datasets: JAMA Clinical Challenge and Medbullets.<n> JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions.<n>We evaluate seven LLMs on the two datasets using various prompts.
arXiv Detail & Related papers (2024-02-28T05:44:41Z)
Explanatory Argument Extraction of Correct Answers in Resident Medical Exams [5.399800035598185]
We present a new dataset which includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct. This new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors.
arXiv Detail & Related papers (2023-12-01T13:22:35Z)
ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality. We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain [4.989459243399296]
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy.
arXiv Detail & Related papers (2023-04-09T16:57:40Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam. These questions are the most challenging for current QA systems. We present a Multi-step reasoning with Knowledge extraction framework (MurKe) We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.