Related papers: WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

URL: http://arxiv.org/abs/2511.16544v2
Date: Fri, 21 Nov 2025 15:29:43 GMT
Title: WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Authors: Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim,
Abstract summary: Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue.<n>Standard evaluations still rely heavily on Error Error Rate (WER)<n>This paper challenges that standard, investigating whether WER or other common metrics correlate with clinical impact of transcription errors.
Score: 3.468314243424983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

Related papers

AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z)
Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems [19.880569341968023]
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety.<n>We propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics.
arXiv Detail & Related papers (2026-01-21T16:40:41Z)
Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z)
ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data [3.6652537579778106]
Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention.<n>We introduce a fully automatic two-stage framework for ICP grading, integrating ONSD measurement and clinical data.<n>Our method achieves a validation accuracy of $0.845 pm 0.071$ and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method.
arXiv Detail & Related papers (2025-09-11T11:37:48Z)
OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries [2.2807344448218507]
We evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench.<n>On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51.<n>In a separate 100-sample evaluation against similar agentic RAG assistants, it maintains a performance lead with a health-bench score of 0.54.
arXiv Detail & Related papers (2025-08-29T09:51:41Z)
Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation [32.410641778559544]
ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation) is an interpretable evaluation framework.<n>Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other.<n>By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment.
arXiv Detail & Related papers (2025-08-04T18:28:03Z)
GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation [7.838068874909676]
Granular Explainable Multi-Agent Score (GEMA-Score) conducts both objective and subjective evaluation through a large language model-based multi-agent workflow.<n>GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations [74.83732294523402]
We introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards.<n>We also explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes.<n>Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64%$ in multi-round reasoning scenarios and $6.18%$ in accuracy in a noisy environment.
arXiv Detail & Related papers (2025-01-29T18:58:48Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.