Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology
- URL: http://arxiv.org/abs/2503.18709v1
- Date: Mon, 24 Mar 2025 14:23:48 GMT
- Title: Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology
- Authors: Boqi Chen, Cédric Vincent-Cuaz, Lydia A. Schoenpflug, Manuel Madeira, Lisa Fournier, Vaishnavi Subramanian, Sonali Andani, Samuel Ruiperez-Campillo, Julia E. Vogt, Raphaëlle Luisier, Dorina Thanou, Viktor H. Koelzer, Pascal Frossard, Gabriele Campanella, Gunnar Rätsch,
- Abstract summary: Vision foundation models (FMs) learn to represent histological features in highly heterogeneous tiles extracted from whole-slide images.<n>We investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles.
- Score: 41.34847597178388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.
Related papers
- Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation [13.018234326432964]
We propose a novel and generalizable task-specific knowledge distillation framework for medical image segmentation.<n>Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models.<n> Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation.
arXiv Detail & Related papers (2025-03-10T06:39:53Z) - Dataset Distillation for Histopathology Image Classification [46.04496989951066]
We introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD)
We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks.
arXiv Detail & Related papers (2024-08-19T05:53:38Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Explainable Techniques for Analyzing Flow Cytometry Cell Transformers [0.0]
We evaluate the usage of a transformer architecture called ReluFormer that eases attention visualization.
We propose a gradient- and an attention-based visualization technique tailored for Flow CytoMetry (FCM) data.
arXiv Detail & Related papers (2023-07-27T02:03:52Z) - Topologically Regularized Multiple Instance Learning to Harness Data
Scarcity [15.06687736543614]
Multiple Instance Learning models have emerged as a powerful tool to classify patients' microscopy samples.
We introduce a topological regularization term to MIL to mitigate this challenge.
We show an average enhancement of 2.8% for MIL benchmarks, 15.3% for synthetic MIL datasets, and 5.5% for real-world biomedical datasets over the current state-of-the-art.
arXiv Detail & Related papers (2023-07-26T08:14:18Z) - DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models [42.58375679841317]
We propose a new task, disentanglement of Diffusion Probabilistic Models (DPMs)
The task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields.
We devise an unsupervised approach named DisDiff, achieving disentangled representation learning in the framework of DPMs.
arXiv Detail & Related papers (2023-01-31T15:58:32Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - XDEEP-MSI: Explainable Bias-Rejecting Microsatellite Instability Deep
Learning System In Colorectal Cancer [0.0]
We present a system for the prediction of microsatellite instability (MSI) from H&E images of colorectal cancer using deep learning (DL) techniques customized for tissue microarrays (TMAs)
The system incorporates an end-to-end image preprocessing module that produces tiles at multiple magnifications in the regions of interest as guided by a tissue module, and a multiple-bias rejecting module.
arXiv Detail & Related papers (2021-10-28T17:58:01Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.