Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?
- URL: http://arxiv.org/abs/2507.19575v1
- Date: Fri, 25 Jul 2025 17:55:06 GMT
- Title: Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?
- Authors: Ayush Roy, Samin Enam, Jun Xia, Vishnu Suresh Lokhande, Won Hwa Kim,
- Abstract summary: Data scarcity is a major challenge in medical imaging, particularly for deep learning models.<n>We propose a method for controlling foreground-background feature discrepancies across all layers of deep networks.<n>Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets.
- Score: 12.292774988144949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the "Data Addition Dilemma". While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures. The code will be available on Github.
Related papers
- The Data Addition Dilemma [4.869513274920574]
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources.
But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings?
We identify this situation as the textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance.
arXiv Detail & Related papers (2024-08-08T01:42:31Z) - DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis [0.7751705157998379]
The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP.
Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness against Out-of-Distribution (OOD) datasets.
We propose a hierarchical merging approach that involves local and global aggregation of models at various levels.
arXiv Detail & Related papers (2024-03-20T06:48:48Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - MoCo-Transfer: Investigating out-of-distribution contrastive learning
for limited-data domains [52.612507614610244]
We analyze the benefit of transferring self-supervised contrastive representations from moment contrast (MoCo) pretraining to settings with limited data.
We find that depending on quantity of labeled and unlabeled data, contrastive pretraining on larger out-of-distribution datasets can perform nearly as well or better than MoCo pretraining in-domain.
arXiv Detail & Related papers (2023-11-15T21:56:47Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - Dynamic Data Augmentation via MCTS for Prostate MRI Segmentation [19.780410411548935]
We present Dynamic Data Augmentation (DDAug), which is efficient and has negligible cost.
DDAug computation develops a hierarchical tree structure to represent various augmentations.
Our method outperforms the current state-of-the-art data augmentation strategies.
arXiv Detail & Related papers (2023-05-25T06:44:43Z) - Improving CT Image Segmentation Accuracy Using StyleGAN Driven Data
Augmentation [42.034896915716374]
This paper presents a StyleGAN-driven approach for segmenting publicly available large medical datasets.
Style transfer is used to augment the training dataset and generate new anatomically sound images.
The augmented dataset is then used to train a U-Net segmentation network which displays a significant improvement in the segmentation accuracy.
arXiv Detail & Related papers (2023-02-07T06:34:10Z) - Learning from Partially Overlapping Labels: Image Segmentation under
Annotation Shift [68.6874404805223]
We propose several strategies for learning from partially overlapping labels in the context of abdominal organ segmentation.
We find that combining a semi-supervised approach with an adaptive cross entropy loss can successfully exploit heterogeneously annotated data.
arXiv Detail & Related papers (2021-07-13T09:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.