Related papers: Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

URL: http://arxiv.org/abs/2602.05374v1
Date: Thu, 05 Feb 2026 06:52:46 GMT
Title: Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Authors: Chaimae Abouzahir, Congbo Ma, Nizar Habash, Farah E. Shamout,
Abstract summary: Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering.<n>Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities.<n>Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood.
Score: 12.886024273517556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.

Related papers

When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z)
Bridging Language Barriers in Healthcare: A Study on Arabic LLMs [1.2006896500048552]
This paper investigates the challenges of developing large language models proficient in both multilingual understanding and medical knowledge.<n>We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks.
arXiv Detail & Related papers (2025-01-16T20:24:56Z)
Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs [3.1894617416005855]
Large language models (LLMs) present a promising solution to automate various ophthalmology procedures.<n>LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks.<n>This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages.
arXiv Detail & Related papers (2024-12-18T20:18:03Z)
Polish-English medical knowledge transfer: A new benchmark and results [0.6804079979762627]
This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams.<n>It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora.<n>We evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students.
arXiv Detail & Related papers (2024-11-30T19:02:34Z)
Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges [3.0382033111760585]
Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems.<n>We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages.<n>This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z)
The Role of Language Models in Modern Healthcare: A Comprehensive Review [2.048226951354646]
The application of large language models (LLMs) in healthcare has gained significant attention. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs.
arXiv Detail & Related papers (2024-09-25T12:15:15Z)
Evaluating Large Language Models for Radiology Natural Language Processing [68.98847776913381]
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP) This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports.
arXiv Detail & Related papers (2023-07-25T17:57:18Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.