Related papers: Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

URL: http://arxiv.org/abs/2601.04163v1
Date: Wed, 07 Jan 2026 18:24:12 GMT
Title: Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models
Authors: Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen,
Abstract summary: Pathology foundation models (PFMs) have become central to computational pathology.<n>Despite strong benchmark performance, PFM robustness to real-world technical domain shifts remains poorly understood.<n>We evaluate the robustness of 14 PFMs to scanner-induced variability.
Score: 3.8310079617300876
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.

Related papers

Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss [6.310092608526967]
We show that Foundation Models (FMs) still suffer from scanner bias.<n>We propose ScanGen, a contrastive loss function applied during task-specific fine-tuning that mitigates scanner bias.<n>Our approach is applied to the Multiple Instance Learning task of Epidermal Growth Factor Receptor (EGFR) mutation prediction from H&E-stained WSIs in lung cancer.
arXiv Detail & Related papers (2025-07-29T12:35:08Z)
SCORPION: Addressing Scanner-Induced Variability in Histopathology [7.091734389835427]
Ensuring reliable model performance across diverse domains is a critical challenge in computational pathology.<n>We release SCORPION, a new dataset explicitly designed to evaluate model reliability under scanner variability.<n>We propose SimCons, a flexible framework that combines augmentation-based domain generalization techniques with a consistency loss to explicitly address scanner generalization.
arXiv Detail & Related papers (2025-07-28T15:00:49Z)
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts [78.79936076607373]
We introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify robustness of image classifiers for continuous and realistic nuisance shifts.<n>We propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models.
arXiv Detail & Related papers (2025-07-23T16:15:48Z)
A Vector-Quantized Foundation Model for Patient Behavior Monitoring [41.48188433408574]
This paper introduces a novel foundation model based on a modified vector quantized variational autoencoder, specifically designed to process real-world data from smartphones and wearable devices.<n>We leveraged the discrete latent representation of this model to effectively perform two downstream tasks, suicide risk assessment and emotional state prediction, on different held-out clinical cohorts without the need of fine-tuning.
arXiv Detail & Related papers (2025-03-19T14:01:16Z)
Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images [37.3701890138561]
TRUECAM is a framework designed to ensure both data and model trustworthiness in non-small cell lung cancer subtyping with whole-slide images.<n>An AI model wrapped with TRUECAM significantly outperforms models that lack such guidance, in terms of classification accuracy, robustness, interpretability, and data efficiency.
arXiv Detail & Related papers (2024-12-28T02:22:47Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results. It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z)
Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles [4.249986624493547]
Once deployed, medical image analysis methods are often faced with unexpected image corruptions and noise perturbations.<n>LaDiNE is a novel ensemble learning method combining the robustness of Vision Transformers with diffusion-based generative models.<n>Experiments on tuberculosis chest X-rays and melanoma skin cancer datasets demonstrate that LaDiNE achieves superior performance compared to a wide range of state-of-the-art methods.
arXiv Detail & Related papers (2023-10-24T15:53:07Z)
On Sensitivity and Robustness of Normalization Schemes to Input Distribution Shifts in Automatic MR Image Diagnosis [58.634791552376235]
Deep Learning (DL) models have achieved state-of-the-art performance in diagnosing multiple diseases using reconstructed images as input. DL models are sensitive to varying artifacts as it leads to changes in the input data distribution between the training and testing phases. We propose to use other normalization techniques, such as Group Normalization and Layer Normalization, to inject robustness into model performance against varying image artifacts.
arXiv Detail & Related papers (2023-06-23T03:09:03Z)
Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)
Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance. For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming. In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z)
Improved inter-scanner MS lesion segmentation by adversarial training on longitudinal data [0.0]
The evaluation of white matter lesion progression is an important biomarker in the follow-up of MS patients. Current automated lesion segmentation algorithms are susceptible to variability in image characteristics related to MRI scanner or protocol differences. We propose a model that improves the consistency of MS lesion segmentations in inter-scanner studies.
arXiv Detail & Related papers (2020-02-03T16:56:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.