Related papers: PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

URL: http://arxiv.org/abs/2505.18331v1
Date: Fri, 23 May 2025 19:39:01 GMT
Title: PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language
Authors: Naghmeh Jamali, Milad Mohammadi, Danial Baledi, Zahra Rezvani, Hesham Faili,
Abstract summary: PerMedCQA is the first Persian-language benchmark for evaluating large language models for medical consumer question answering.<n>We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel-based evaluation framework driven by an LLM grader.<n>Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems.
Score: 0.1747623282473278
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA

Related papers

MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks [7.822971505079421]
This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks.<n>We first constructed the dataset using past medical exams and publicly available datasets.<n>We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation.
arXiv Detail & Related papers (2025-05-06T11:07:26Z)
MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering [32.60615474034456]
We propose Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank) for multilingual medical question answering.<n>Our framework integrates comprehensive English-centric medical knowledge graphs into LLM reasoning at a low cost.<n>Extensive evaluations on multilingual medical QA benchmarks across Chinese, Japanese, Korean, and Swahili demonstrate that MKG-Rank consistently outperforms zero-shot LLMs.
arXiv Detail & Related papers (2025-03-20T13:25:03Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA) Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z)
MedConceptsQA: Open Source Medical Concepts QA Benchmark [0.07083082555458872]
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. We conducted evaluations of the benchmark using various Large Language Models.
arXiv Detail & Related papers (2024-05-12T17:54:50Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
BiMediX: Bilingual Medical Mixture of Experts LLM [90.3257333861513]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.<n>Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.<n>We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z)
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain. ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF. We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.