A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction
- URL: http://arxiv.org/abs/2511.19858v2
- Date: Wed, 26 Nov 2025 09:29:37 GMT
- Title: A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction
- Authors: Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner,
- Abstract summary: We evaluate zero-shot prompting, static prompting with random exemplars, and retrieval-augmented dynamic prompting.<n>We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate of ROUGE-1, BLEURT, and BERTScore for error correction.
- Score: 8.312687115594512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
Related papers
- Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models [0.0]
We show the importance of prompt optimisation for small and large language models when applied to the task of error detection.<n>We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance.
arXiv Detail & Related papers (2026-02-25T23:46:49Z) - Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation [3.952186976672079]
We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods.<n>To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component.
arXiv Detail & Related papers (2025-10-08T23:50:58Z) - Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z) - Hide and Seek with LLMs: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis [51.88592148135258]
We propose Hide and Seek Game (HSG), a dynamic adversarial framework for error generation and diagnosis.<n>HSG involves two adversarial roles: Sneaky, which "hides" by generating subtle, deceptive reasoning errors, and Diagnosis, which "seeks" to accurately detect them.<n> Experiments on several math reasoning tasks show that HSG significantly boosts error diagnosis, achieving 16.8%--31.4% higher accuracy than baselines like GPT-4o.
arXiv Detail & Related papers (2025-08-05T12:45:21Z) - Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction [4.304383298057423]
We propose the Reliable Correction Framework (RLLM-CF), which consists of three stages: error pre-detection, chain-of-thought sub-tasks iterative correction, and reasoning process verification.<n>Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
arXiv Detail & Related papers (2025-05-30T08:40:49Z) - Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs [87.79350168490475]
This work formally defines self-consistent errors and evaluates mainstream detection methods on them.<n>All four types of detection methods significantly struggle to detect self-consistent errors.<n>Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe.
arXiv Detail & Related papers (2025-05-23T09:18:56Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection [62.942077348224046]
Speech recognition plays an important role in automatic detection of Alzheimer's disease (AD)<n>Recent studies have revealed a non-linear relationship between word error rates (WER) and AD detection performance.<n>This work presents a series of analyses to explore the effect of ASR transcription errors in BERT-based AD detection systems.
arXiv Detail & Related papers (2024-12-09T09:32:20Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction [5.7931394318054155]
We present our approach that achieved top performance in all three subtasks.
For the MS dataset, which contains subtle errors, we developed a retrieval-based system.
For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors.
arXiv Detail & Related papers (2024-04-22T19:31:45Z) - Word-level confidence estimation for RNN transducers [7.12355127219356]
We present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Network Transducers (RNN-T)
Compared to other existing approaches, our model utilizes: (a) the time information associated with recognized words, which reduces the computational complexity, and (b) a simple and elegant trick for mapping between sub-word and word sequences.
arXiv Detail & Related papers (2021-09-28T18:38:00Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.