MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
- URL: http://arxiv.org/abs/2511.00421v1
- Date: Sat, 01 Nov 2025 06:19:34 GMT
- Title: MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
- Authors: Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa,
- Abstract summary: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts remains under-evaluated.<n>We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks.<n>We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts -- a prerequisite for safe deployment -- remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement in error detection and 51.0% in sentence extraction; (ii) cross-lingual evaluation reveals 5-10% performance gaps from English to Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA fine-tuning yields asymmetric improvements in error correction performance (Japanese: +0.078, English: +0.168) while preserving reasoning capabilities; and (iv) our fine-tuned model exceeds human expert performance on structured medical error correction tasks. To our knowledge, MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework and resources for developing safer medical LLMs across languages.
Related papers
- Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models [0.0]
We show the importance of prompt optimisation for small and large language models when applied to the task of error detection.<n>We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance.
arXiv Detail & Related papers (2026-02-25T23:46:49Z) - MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts [70.64143198545031]
We propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance.<n>Our results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics.
arXiv Detail & Related papers (2025-10-15T12:50:33Z) - SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations [0.4077787659104315]
SwasthLLM is a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis.<n>It operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning.<n>SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings.
arXiv Detail & Related papers (2025-09-24T21:20:49Z) - From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations [45.414878840652115]
Large language models (LLMs) have demonstrated promising performance on medical benchmarks.<n>However, their ability to perform medical calculations remains underexplored and poorly evaluated.<n>In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness.
arXiv Detail & Related papers (2025-09-20T09:10:26Z) - ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z) - LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes [22.401540975926324]
We introduce MEDEC, the first publicly available benchmark for medical error detection and correction in clinical notes.<n> MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems.<n>We evaluate recent LLMs for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities.
arXiv Detail & Related papers (2024-12-26T15:54:10Z) - WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction [5.7931394318054155]
We present our approach that achieved top performance in all three subtasks.
For the MS dataset, which contains subtle errors, we developed a retrieval-based system.
For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors.
arXiv Detail & Related papers (2024-04-22T19:31:45Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.