CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare
- URL: http://arxiv.org/abs/2512.11437v1
- Date: Fri, 12 Dec 2025 10:19:27 GMT
- Title: CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare
- Authors: Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, Chirag Agarwal,
- Abstract summary: We present CLINIC, a Comprehensive Benchmark to evaluate the trustworthiness of language models in healthcare.<n>Our evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks.
- Score: 25.074475493111162
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.
Related papers
- Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks [12.886024273517556]
Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering.<n>Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities.<n>Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood.
arXiv Detail & Related papers (2026-02-05T06:52:46Z) - Toward Global Large Language Models in Medicine [67.38063166560406]
GlobMed is a large multilingual medical dataset containing over 500,000 entries spanning 12 languages, including four low-resource languages.<n>GlobMed-Bench assesses 56 state-of-the-art proprietary and open-weight LLMs across multiple multilingual medical tasks, revealing significant performance disparities across languages.<n>GlobMed-LLMs achieved an average performance improvement of over 40% relative to baseline models, with a more than threefold increase in performance on low-resource languages.
arXiv Detail & Related papers (2026-01-05T15:05:49Z) - JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models [47.20100799532625]
We introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of Large Language Models.<n>Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
arXiv Detail & Related papers (2026-01-04T18:18:18Z) - Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models [50.34755385896279]
Confidence calibration is crucial for the reliable deployment of Large Language Models (LLMs)<n>We conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages.<n>We find that non-English languages suffer from systematically worse calibration.
arXiv Detail & Related papers (2025-10-03T16:07:15Z) - Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z) - MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.77103365251923]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z) - Bridging Language Barriers in Healthcare: A Study on Arabic LLMs [1.2006896500048552]
This paper investigates the challenges of developing large language models proficient in both multilingual understanding and medical knowledge.<n>We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks.
arXiv Detail & Related papers (2025-01-16T20:24:56Z) - Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs [3.1894617416005855]
Large language models (LLMs) present a promising solution to automate various ophthalmology procedures.<n>LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks.<n>This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages.
arXiv Detail & Related papers (2024-12-18T20:18:03Z) - Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges [3.0382033111760585]
Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems.<n>We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages.<n>This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z) - CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models [92.04812189642418]
We introduce CARES and aim to evaluate the Trustworthiness of Med-LVLMs across the medical domain.
We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness.
arXiv Detail & Related papers (2024-06-10T04:07:09Z) - Better to Ask in English: Cross-Lingual Evaluation of Large Language
Models for Healthcare Queries [31.82249599013959]
Large language models (LLMs) are transforming the ways the general public accesses and consumes information.
LLMs demonstrate impressive language understanding and generation proficiencies, but concerns regarding their safety remain paramount.
It remains unclear how these LLMs perform in the context of non-English languages.
arXiv Detail & Related papers (2023-10-19T20:02:40Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.