Related papers: The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making

The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making

URL: http://arxiv.org/abs/2506.17163v1
Date: Fri, 20 Jun 2025 17:09:27 GMT
Title: The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making
Authors: Abinitha Gourabathina, Yuexing Hao, Walter Gerych, Marzyeh Ghassemi,
Abstract summary: We introduce MedPerturb, a dataset designed to evaluate medical Large Language Models (LLMs) under controlled perturbations of clinical input.<n>With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability.<n>We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs.
Score: 13.734312822024947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z)
Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs [37.6690828097719]
Large language models (LLMs) show promise for clinical use.<n>Many medical datasets rely on simplified Question-Answering (QA) that underrepresents real-world clinical decision-making.<n>We propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions.
arXiv Detail & Related papers (2025-10-22T20:06:10Z)
Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications [16.066280458640676]
The integration of large language models into healthcare holds promise to enhance clinical decision-making.<n>Gender has long influenced physician behaviors and patient outcomes.<n>Some models even displayed a systematic female-male disparity in their interpretation of patient gender.
arXiv Detail & Related papers (2025-10-08T01:11:06Z)
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task [0.0]
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks.<n>This study examines whether LLMs can approximate individual differences in the phonemic fluency task.
arXiv Detail & Related papers (2025-05-22T03:08:27Z)
Investigating LLMs in Clinical Triage: Promising Capabilities, Persistent Intersectional Biases [6.135648377533492]
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored.<n>We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions.<n>We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches.
arXiv Detail & Related papers (2025-04-22T21:11:47Z)
Medical large language models are easily distracted [0.8211696054238238]
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance.<n>We developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions.
arXiv Detail & Related papers (2025-04-01T21:34:01Z)
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare [38.0169924254127]
We find that gender information is highly localized in middle layers and can be reliably manipulated at inference time via patching.<n>We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree.
arXiv Detail & Related papers (2025-02-18T22:40:40Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? [2.7476176772825904]
This research investigates the evaluation and mitigation of bias in Large Language Models (LLMs) We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations.
arXiv Detail & Related papers (2024-10-21T23:14:10Z)
Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
We introduce ALternate Contrastive Decoding (ALCD) to solve hallucination issues in medical information extraction tasks. ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
arXiv Detail & Related papers (2024-10-21T07:19:19Z)
Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z)
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions. VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z)
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments [2.567146936147657]
We introduce AgentClinic, a multimodal agent benchmark for evaluating large language models (LLM) in simulated clinical environments.<n>We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy.
arXiv Detail & Related papers (2024-05-13T17:38:53Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.