Related papers: Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

URL: http://arxiv.org/abs/2602.22483v1
Date: Wed, 25 Feb 2026 23:46:49 GMT
Title: Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Authors: Craig Myles, Patrick Schrempf, David Harris-Birtill,
Abstract summary: We show the importance of prompt optimisation for small and large language models when applied to the task of error detection.<n>We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection

Related papers

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation [46.0800756149113]
CURE is an error-aware curriculum learning framework for medical vision-language models.<n>It fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation.<n>CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%.
arXiv Detail & Related papers (2026-01-21T19:19:41Z)
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction [8.312687115594512]
We evaluate zero-shot prompting, static prompting with random exemplars, and retrieval-augmented dynamic prompting.<n>We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate of ROUGE-1, BLEURT, and BERTScore for error correction.
arXiv Detail & Related papers (2025-11-25T02:40:49Z)
MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts [0.0]
Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts remains under-evaluated.<n>We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks.<n>We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families.
arXiv Detail & Related papers (2025-11-01T06:19:34Z)
SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations [0.4077787659104315]
SwasthLLM is a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis.<n>It operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning.<n>SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings.
arXiv Detail & Related papers (2025-09-24T21:20:49Z)
Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z)
Arabic Large Language Models for Medical Text Generation [0.5483130283061118]
This study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation.<n>The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input.
arXiv Detail & Related papers (2025-09-12T09:37:26Z)
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities [54.152681077418805]
Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalizations of model capabilities.<n>We propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities.<n>Our approach improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting.
arXiv Detail & Related papers (2025-05-29T05:25:27Z)
Leveraging Language Models for Automated Patient Record Linkage [0.5461938536945723]
This study investigates the feasibility of leveraging language models for automated patient record linkage.<n>We utilize real-world healthcare data from the Missouri Cancer Registry and Research Center.
arXiv Detail & Related papers (2025-04-21T17:41:15Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
Collaborative Boundary-aware Context Encoding Networks for Error Map Prediction [65.44752447868626]
We propose collaborative boundaryaware context encoding networks called AEP-Net for error prediction task. Specifically, we propose a collaborative feature transformation branch for better feature fusion between images and masks, and precise localization of error regions. The AEP-Net achieves an average DSC of 0.8358, 0.8164 for error prediction task, and shows a high Pearson correlation coefficient of 0.9873.
arXiv Detail & Related papers (2020-06-25T12:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.