Related papers: MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

URL: http://arxiv.org/abs/2511.00421v1
Date: Sat, 01 Nov 2025 06:19:34 GMT
Title: MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
Authors: Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa,
Abstract summary: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts remains under-evaluated.<n>We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks.<n>We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts -- a prerequisite for safe deployment -- remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement in error detection and 51.0% in sentence extraction; (ii) cross-lingual evaluation reveals 5-10% performance gaps from English to Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA fine-tuning yields asymmetric improvements in error correction performance (Japanese: +0.078, English: +0.168) while preserving reasoning capabilities; and (iv) our fine-tuned model exceeds human expert performance on structured medical error correction tasks. To our knowledge, MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework and resources for developing safer medical LLMs across languages.

Related papers

Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models [0.0]
We show the importance of prompt optimisation for small and large language models when applied to the task of error detection.<n>We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance.
arXiv Detail & Related papers (2026-02-25T23:46:49Z)
MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts [70.64143198545031]
We propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance.<n>Our results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics.
arXiv Detail & Related papers (2025-10-15T12:50:33Z)
SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations [0.4077787659104315]
SwasthLLM is a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis.<n>It operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning.<n>SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings.
arXiv Detail & Related papers (2025-09-24T21:20:49Z)
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations [45.414878840652115]
Large language models (LLMs) have demonstrated promising performance on medical benchmarks.<n>However, their ability to perform medical calculations remains underexplored and poorly evaluated.<n>In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness.
arXiv Detail & Related papers (2025-09-20T09:10:26Z)
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes [22.401540975926324]
We introduce MEDEC, the first publicly available benchmark for medical error detection and correction in clinical notes.<n> MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems.<n>We evaluate recent LLMs for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities.
arXiv Detail & Related papers (2024-12-26T15:54:10Z)
WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction [5.7931394318054155]
We present our approach that achieved top performance in all three subtasks. For the MS dataset, which contains subtle errors, we developed a retrieval-based system. For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors.
arXiv Detail & Related papers (2024-04-22T19:31:45Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark. It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification. We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.