Related papers: Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

URL: http://arxiv.org/abs/2602.10359v1
Date: Tue, 10 Feb 2026 23:08:06 GMT
Title: Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT
Authors: Jineel H Raythatha, Shuchang Ye, Jeremy Hsu, Jinman Kim,
Abstract summary: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift.<n>We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class.
Score: 8.050646314390763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Purpose: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift, where severe class imbalance coexists with heterogeneous imaging appearances. This challenge is relevant for traumatic bowel injury, a rare but high-mortality diagnosis. We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class. Methods: This retrospective study used the multi-institutional, RSNA Abdominal Traumatic Injury CT dataset (2019-2023), comprising scans from 23 centres. Two foundation models (MedCLIP, zero-shot; RadDINO, linear probe) were compared against three task-specific approaches (CNN, Transformer, Ensemble). Models were trained on 3,147 patients (2.3% bowel injury prevalence) and evaluated on an enriched 100-patient test set. To isolate negative-class effects, specificity was assessed in patients without bowel injury who had concurrent solid organ injury (n=58) versus no abdominal pathology (n=50). Results: Foundation models achieved equivalent discrimination to task-specific models (AUC, 0.64-0.68 versus 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models demonstrated high specificity in patients without abdominal pathology (84-100%). When solid organ injuries were present, specificity declined substantially for foundation models (50-51 percentage points) compared with smaller reductions of 12-41 percentage points for task-specific models. Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labelled training, suggesting adaptation is required before clinical implementation.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis [42.60508892284938]
We present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging.<n>We tested 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts.<n>Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation.
arXiv Detail & Related papers (2025-12-01T11:03:27Z)
MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis [40.3028468133626]
Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology.<n>textbfMeCaMIL, a causality-aware MIL framework, explicitly models demographic confounders through structured causal graphs.<n>MeCaMIL achieves superior fairness -- demographic disparity variance drops by over 65% relative reduction on average across attributes.
arXiv Detail & Related papers (2025-11-14T06:47:21Z)
Taylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification [0.0]
This study introduces Kolmogorov-Arnold Networks (KANs) for accurate medical image classification with limited, diverse datasets.<n>The models include SBTAYLOR-KAN, integrating B-splines with Taylor series; SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN, embedding B-splines in Morlet wavelet transforms.<n>The models were evaluated on brain MRI, chest X-rays, tuberculosis X-rays, and skin lesion images without preprocessing.
arXiv Detail & Related papers (2025-09-17T04:33:54Z)
Deep Learning for Glioblastoma Morpho-pathological Features Identification: A BraTS-Pathology Challenge Solution [5.347187213114967]
We present our approach to the BraTS-Path Challenge 2024.<n>We leverage a pre-trained model and fine-tune it on the BraTS-Path training dataset.<n>Our model exhibits perfect specificity of 0.898704, showing an exceptional capacity to correctly classify negative cases.
arXiv Detail & Related papers (2025-07-24T06:47:23Z)
Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging [43.06199185109424]
Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time.<n>Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances.<n>We developed a self-training pipeline for ULDT using a limited 11.5% subset of DeepLesion.
arXiv Detail & Related papers (2025-04-07T15:57:03Z)
Medical Hallucinations in Foundation Models and Their Impact on Healthcare [71.15392179084428]
Hallucinations in foundation models arise from autoregressive training objectives.<n>Top-performing models exceeded 97% accuracy when augmented with chain-of-thought prompting.
arXiv Detail & Related papers (2025-02-26T02:30:44Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE) CRTRE applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z)
Artificial Intelligence-Based Triaging of Cutaneous Melanocytic Lesions [0.8864540224289991]
Pathologists are facing an increasing workload due to a growing volume of cases and the need for more comprehensive diagnoses. We developed an artificial intelligence (AI) model for triaging cutaneous melanocytic lesions based on whole slide images.
arXiv Detail & Related papers (2024-10-14T13:49:04Z)
A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification [1.9499122087408571]
Histopathology foundation models show great promise across many tasks. We report the most rigorous single-task validation of histopathology foundation models to date. Histopathology foundation models offer a clear benefit to ovarian cancer subtyping.
arXiv Detail & Related papers (2024-05-16T11:21:02Z)
Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z)
Multi-institutional Validation of Two-Streamed Deep Learning Method for Automated Delineation of Esophageal Gross Tumor Volume using planning-CT and FDG-PETCT [14.312659667401302]
Current clinical workflow for esophageal gross tumor volume (GTV) contouring relies on manual delineation of high labor-costs and interuser variability. To validate the clinical applicability of a deep learning (DL) multi-modality esophageal GTV contouring model, developed at 1 institution whereas tested at multiple ones.
arXiv Detail & Related papers (2021-10-11T13:56:09Z)
CT-based COVID-19 Triage: Deep Multitask Learning Improves Joint Identification and Severity Quantification [45.86448200141968]
We describe two basic setups: Identification of COVID-19 to prioritize studies of potentially infected patients to isolate them as early as possible; Severity quantification to highlight studies of severe patients and direct them to a hospital or provide emergency medical care. We propose a multitask approach to consolidate both triage approaches and propose a convolutional neural network to combine all available labels within a single model. We train our model on approximately 2000 publicly available CT studies and test it with a carefully designed set consisting of 32 COVID-19 studies, 30 cases with bacterial pneumonia, 31 healthy patients, and 30 patients with other lung pathologies to emulate a typical patient flow in
arXiv Detail & Related papers (2020-06-02T08:05:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.