MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
- URL: http://arxiv.org/abs/2602.05692v1
- Date: Thu, 05 Feb 2026 14:18:20 GMT
- Title: MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
- Authors: Congbo Ma, Yichun Zhang, Yousef Al-Jazzazi, Ahamed Foisal, Laasya Sharma, Yousra Sadqi, Khaled Saleh, Jihad Mallat, Farah E. Shamout,
- Abstract summary: We introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction.<n>Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese.<n>Results reveal notable performance gaps, particularly in non-English settings.
- Score: 4.451052650309736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
Related papers
- BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text [14.409097921305134]
BRIDGE is a comprehensive benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages.<n>It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications.<n>Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties.
arXiv Detail & Related papers (2025-04-28T04:13:18Z) - Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges [3.0382033111760585]
Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems.<n>We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages.<n>This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by
Diminishing Bias [38.26934474189853]
Unifying Cross-Lingual Medical Vision-Language Pre-Training (Med-UniC) designed to integrate multimodal medical data from English and Spanish.
Med-UniC reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases.
arXiv Detail & Related papers (2023-05-31T14:28:19Z) - Cross-lingual Argument Mining in the Medical Domain [6.0158981171030685]
We show how to perform Argument Mining (AM) in medical texts for which no annotated data is available.
Our work shows that automatically translating and projecting annotations (data-transfer) from English to a given target language is an effective way to generate annotated data.
We also show how the automatically generated data in Spanish can also be used to improve results in the original English monolingual setting.
arXiv Detail & Related papers (2023-01-25T11:21:12Z) - Cross-Lingual Knowledge Transfer for Clinical Phenotyping [55.92262310716537]
We investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language.
We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains.
Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
arXiv Detail & Related papers (2022-08-03T08:33:21Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - Labeling of Multilingual Breast MRI Reports [1.8374319565577157]
We present a framework for developing a multilingual breast MRI report classifier using a custom-built language representation called LAMBR.
Our proposed method overcomes practical challenges faced in clinical settings, and we demonstrate improved performance in extracting labels from medical reports.
arXiv Detail & Related papers (2020-07-06T19:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.