Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
- URL: http://arxiv.org/abs/2508.15797v1
- Date: Wed, 13 Aug 2025 10:41:17 GMT
- Title: Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
- Authors: Nouar AlDahoul, Yasir Zaki,
- Abstract summary: This research examines the degree to which state-of-the-art large language models demonstrate and articulate healthcare knowledge in Arabic.<n>We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track.<n>Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers.
- Score: 1.3521447196536418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.
Related papers
- DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z) - CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text [3.9845507207125967]
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting.<n>We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs)<n>Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task.
arXiv Detail & Related papers (2025-07-10T08:35:05Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks [8.379270814399431]
This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks.<n>We first constructed the dataset using past medical exams and publicly available datasets.<n>We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation.
arXiv Detail & Related papers (2025-05-06T11:07:26Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs [3.1894617416005855]
Large language models (LLMs) present a promising solution to automate various ophthalmology procedures.<n>LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks.<n>This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages.
arXiv Detail & Related papers (2024-12-18T20:18:03Z) - D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models [5.439020425819001]
Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks.
However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning.
arXiv Detail & Related papers (2024-05-07T10:11:14Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering [8.110978727364397]
Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology.
This paper presents MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering.
arXiv Detail & Related papers (2024-04-08T15:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.