Related papers: DeVisE: Behavioral Testing of Medical Large Language Models

DeVisE: Behavioral Testing of Medical Large Language Models

URL: http://arxiv.org/abs/2506.15339v1
Date: Wed, 18 Jun 2025 10:42:22 GMT
Title: DeVisE: Behavioral Testing of Medical Large Language Models
Authors: Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto,
Abstract summary: DeVisE is a behavioral testing framework for probing fine-grained clinical understanding.<n>We construct a dataset of ICU discharge notes from MIMIC-IV.<n>We evaluate five LLMs spanning general-purpose and medically fine-tuned variants.
Score: 14.832083455439749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

Related papers

PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset [16.263862005367667]
Multimodal deep learning holds promise for improving clinical prediction by integrating diverse patient data.<n>Contrastive learning facilitates this integration by producing a unified representation that can be reused across tasks.<n>PiCME is the first to scale contrastive learning across all modality combinations in MIMIC.
arXiv Detail & Related papers (2025-07-03T20:45:37Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions [46.60244609728416]
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a litmus test of a model's utility in clinical practice. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs) We reveal four surprising results about LMs/LLMs.
arXiv Detail & Related papers (2024-06-17T19:50:40Z)
Bias patterns in the application of LLMs for clinical decision support: A comprehensive study [2.089191490381739]
Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. These models play an increasingly prominent role in shaping the digital landscape. Two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases?
arXiv Detail & Related papers (2024-04-23T15:52:52Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models [69.09570726777817]
We introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input. We show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.
arXiv Detail & Related papers (2021-11-30T15:52:04Z)
EventScore: An Automated Real-time Early Warning Score for Clinical Events [3.3039612529376625]
We build an interpretable model for the early prediction of various adverse clinical events indicative of clinical deterioration. The model is evaluated on two datasets and four clinical events. Our model can be entirely automated without requiring any manually recorded features.
arXiv Detail & Related papers (2021-02-11T11:55:08Z)
Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks. Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets. We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.