Related papers: Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

URL: http://arxiv.org/abs/2602.07319v1
Date: Sat, 07 Feb 2026 02:25:44 GMT
Title: Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice
Authors: Savan Doshi,
Abstract summary: We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language.<n>We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests.
Score: 0.1609950046042424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.

Related papers

Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models [18.902372087770562]
Personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise.<n>Our work shows that personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise.
arXiv Detail & Related papers (2026-01-08T21:01:11Z)
HACK: Hallucinations Along Certainty and Knowledge Axes [66.66625343090743]
We propose a framework for categorizing hallucinations along two axes: knowledge and certainty.<n>We identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally.
arXiv Detail & Related papers (2025-10-28T09:34:31Z)
Hallucination Benchmark for Speech Foundation Models [33.92968426403491]
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal)<n>This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law.<n>We introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic.
arXiv Detail & Related papers (2025-10-18T16:26:16Z)
Beyond Accuracy: Rethinking Hallucination and Regulatory Response in Generative AI [7.068082004005692]
Hallucination in generative AI is often treated as a technical failure to produce factually correct output.<n>This paper critically examines how regulatory and evaluation frameworks have inherited a narrow view of hallucination.
arXiv Detail & Related papers (2025-09-12T19:41:10Z)
Competing Risks: Impact on Risk Estimation and Algorithmic Fairness [0.0]
Survival analysis accounts for patients who do not experience the event of interest during the study period, known as censored patients.<n> competing risks are often treated as censoring, a practice frequently overlooked due to a limited understanding of its consequences.<n>Our work shows why treating competing risks as censoring introduces substantial bias in survival estimates, leading to systematic overestimation of risk and, critically, amplifying disparities.
arXiv Detail & Related papers (2025-08-07T14:25:43Z)
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them [52.764019220214344]
Hallucinations pose critical risks for large language model (LLM)-based agents.<n>We present MIRAGE-Bench, the first unified benchmark for eliciting and evaluating hallucinations in interactive environments.
arXiv Detail & Related papers (2025-07-28T17:38:29Z)
Metrics that matter: Evaluating image quality metrics for medical image generation [48.85783422900129]
This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
arXiv Detail & Related papers (2025-05-12T01:57:25Z)
Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection [25.31502165275055]
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models.<n>We conduct a large-scale empirical evaluation of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods.
arXiv Detail & Related papers (2025-04-25T06:37:29Z)
Medical Hallucinations in Foundation Models and Their Impact on Healthcare [71.15392179084428]
Hallucinations in foundation models arise from autoregressive training objectives.<n>Top-performing models exceeded 97% accuracy when augmented with chain-of-thought prompting.
arXiv Detail & Related papers (2025-02-26T02:30:44Z)
Towards Mitigating Hallucination in Large Language Models via Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z)
Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability [82.29775890542967]
Estimating personalized effects of treatments is a complex, yet pervasive problem. Recent developments in the machine learning literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools. We use post-hoc feature importance methods to identify features that influence the model's predictions.
arXiv Detail & Related papers (2022-06-16T17:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.