Related papers: MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

URL: http://arxiv.org/abs/2510.12818v1
Date: Thu, 09 Oct 2025 22:12:58 GMT
Title: MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
Authors: Rajarshi Ghosh, Abhay Gupta, Hudson McBride, Anurag Vaidya, Faisal Mahmood,
Abstract summary: We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns while holding critical symptoms and conditions constant.<n>We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants.<n>Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering.
Score: 7.167933033102407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.

Related papers

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots [1.0499611180329804]
The transition between initial model inference and expert correction is rarely analyzed as a structured signal.<n>We introduce a diagnostic alignment framework in which the AI-generated image based report is preserved as an immutable inference state.
arXiv Detail & Related papers (2026-02-26T13:11:58Z)
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis [13.241795322837861]
We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases.<n>We measure susceptibility via Bias Trap Rate--probability of misdiagnosing traps despite correctly diagnosing controls.
arXiv Detail & Related papers (2026-01-10T17:39:25Z)
Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology [34.80893325510028]
Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility.<n>We use coupled to human evaluation and clinical review to assess six small open-source medical LLMs.
arXiv Detail & Related papers (2025-12-26T14:30:53Z)
MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis [40.3028468133626]
Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology.<n>textbfMeCaMIL, a causality-aware MIL framework, explicitly models demographic confounders through structured causal graphs.<n>MeCaMIL achieves superior fairness -- demographic disparity variance drops by over 65% relative reduction on average across attributes.
arXiv Detail & Related papers (2025-11-14T06:47:21Z)
Conformal Lesion Segmentation for 3D Medical Images [82.92159832699583]
We propose a risk-constrained framework that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance.<n>We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
arXiv Detail & Related papers (2025-10-19T08:21:00Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs [0.0]
We propose a geometry-aware evaluation framework, LAPD (Latent Agentic Perturbation Diagnostics), which probes the latent robustness of clinical LLMs under structured adversarial edits.<n>Within this framework, we introduce Latent Diagnosis Flip Rate (LDFR), a model-agnostic diagnostic signal that captures representational instability when embeddings cross decision boundaries in PCA-reduced latent space.<n>Our results reveal a persistent gap between surface robustness and semantic stability, underscoring the importance of geometry-aware auditing in safety-critical clinical AI.
arXiv Detail & Related papers (2025-07-27T16:48:53Z)
DeVisE: Behavioral Testing of Medical Large Language Models [14.832083455439749]
DeVisE is a behavioral testing framework for probing fine-grained clinical understanding.<n>We construct a dataset of ICU discharge notes from MIMIC-IV.<n>We evaluate five LLMs spanning general-purpose and medically fine-tuned variants.
arXiv Detail & Related papers (2025-06-18T10:42:22Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Interpretability of Uncertainty: Exploring Cortical Lesion Segmentation in Multiple Sclerosis [33.91263917157504]
Uncertainty quantification (UQ) has become critical for evaluating the reliability of artificial intelligence systems. This study addresses the interpretability of instance-wise uncertainty values in deep learning models for focal lesion segmentation in magnetic resonance imaging.
arXiv Detail & Related papers (2024-07-08T09:13:30Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Structural-Based Uncertainty in Deep Learning Across Anatomical Scales: Analysis in White Matter Lesion Segmentation [8.64414399041931]
Uncertainty quantification (UQ) is an indicator of the trustworthiness of automated deep-learning (DL) tools in the context of white matter lesion (WML) segmentation. We develop measures for quantifying uncertainty at lesion and patient scales, derived from structural prediction discrepancies. The results from a multi-centric MRI dataset of 444 patients demonstrate that our proposed measures more effectively capture model errors at the lesion and patient scales.
arXiv Detail & Related papers (2023-11-15T13:04:57Z)
Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.