Related papers: PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

URL: http://arxiv.org/abs/2506.00250v2
Date: Tue, 03 Jun 2025 00:22:37 GMT
Title: PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain
Authors: Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery,
Abstract summary: Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy.<n>In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions.<n>We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought settings.
Score: 3.2640411992544345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA

Related papers

PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language [0.1747623282473278]
PerMedCQA is the first Persian-language benchmark for evaluating large language models for medical consumer question answering.<n>We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel-based evaluation framework driven by an LLM grader.<n>Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems.
arXiv Detail & Related papers (2025-05-23T19:39:01Z)
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models [0.5221124918965586]
This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian.<n>This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses.<n>It covers a wide range of domains and tasks, including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights.
arXiv Detail & Related papers (2025-04-20T17:43:47Z)
PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian [19.816050739495573]
PerCul is a dataset designed to assess the sensitivity of LLMs toward Persian culture.<n>PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios.<n>We evaluate several state-of-the-art multilingual and Persian-specific LLMs.
arXiv Detail & Related papers (2025-02-11T11:07:44Z)
A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context [0.9074663948713616]
Mental health disorders pose a growing public health concern in the Arab world.<n>This study comprehensively evaluates 8 large language models (LLMs) on diverse mental health datasets.
arXiv Detail & Related papers (2025-01-12T16:17:25Z)
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT [4.574416868427695]
This paper explores the efficacy of large language models (LLMs) for Persian. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks.
arXiv Detail & Related papers (2024-04-03T02:12:29Z)
BiMediX: Bilingual Medical Mixture of Experts LLM [90.3257333861513]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.<n>Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.<n>We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.