IMB: An Italian Medical Benchmark for Question Answering
- URL: http://arxiv.org/abs/2510.18468v1
- Date: Tue, 21 Oct 2025 09:45:59 GMT
- Title: IMB: An Italian Medical Benchmark for Question Answering
- Authors: Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Vincenzo Moscato,
- Abstract summary: We present two comprehensive Italian medical benchmarks: textbfIMB-QA, containing 782,644 patient-doctor conversations from 77 medical categories, and textbfIMB-MCQA, comprising 25,862 multiple-choice questions from medical specialty examinations.<n>We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style.<n>Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question
- Score: 11.555285143713315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.
Related papers
- MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers [35.41469674626373]
We introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese.<n>It comprises 384,095 authentic question-answer pairs from patient-doctor interactions.<n>Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication.
arXiv Detail & Related papers (2025-11-14T21:13:28Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.<n>We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.<n>Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z) - A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study [4.769236554995528]
We propose a retrieval-augmented generation architecture for medical question answering on emerging issues associated with health-related topics.<n>Our framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data.<n>Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO.
arXiv Detail & Related papers (2024-05-29T20:56:52Z) - Large Language Models for Multi-Choice Question Classification of Medical Subjects [0.2020207586732771]
We train deep neural networks for multi-class classification of questions into the inferred medical subjects.
We show the capability of AI and LLMs in particular for multi-classification tasks in the Healthcare domain.
arXiv Detail & Related papers (2024-03-21T17:36:08Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Developing ChatGPT for Biology and Medicine: A Complete Review of
Biomedical Question Answering [25.569980942498347]
ChatGPT explores a strategic blueprint of question answering (QA) in delivering medical diagnosis, treatment recommendations, and other healthcare support.
This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms.
arXiv Detail & Related papers (2024-01-15T07:21:16Z) - Explanatory Argument Extraction of Correct Answers in Resident Medical
Exams [5.399800035598185]
We present a new dataset which includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct.
This new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors.
arXiv Detail & Related papers (2023-12-01T13:22:35Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.