Synthetic Augmentation with Large-scale Unconditional Pre-training
- URL: http://arxiv.org/abs/2308.04020v1
- Date: Tue, 8 Aug 2023 03:34:04 GMT
- Title: Synthetic Augmentation with Large-scale Unconditional Pre-training
- Authors: Jiarong Ye, Haomiao Ni, Peng Jin, Sharon X. Huang, Yuan Xue
- Abstract summary: We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
- Score: 4.162192894410251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning based medical image recognition systems often require a
substantial amount of training data with expert annotations, which can be
expensive and time-consuming to obtain. Recently, synthetic augmentation
techniques have been proposed to mitigate the issue by generating realistic
images conditioned on class labels. However, the effectiveness of these methods
heavily depends on the representation capability of the trained generative
model, which cannot be guaranteed without sufficient labeled training data. To
further reduce the dependency on annotated data, we propose a synthetic
augmentation method called HistoDiffusion, which can be pre-trained on
large-scale unlabeled datasets and later applied to a small-scale labeled
dataset for augmented training. In particular, we train a latent diffusion
model (LDM) on diverse unlabeled datasets to learn common features and generate
realistic images without conditional inputs. Then, we fine-tune the model with
classifier guidance in latent space on an unseen labeled dataset so that the
model can synthesize images of specific categories. Additionally, we adopt a
selective mechanism to only add synthetic samples with high confidence of
matching to target labels. We evaluate our proposed method by pre-training on
three histopathology datasets and testing on a histopathology dataset of
colorectal cancer (CRC) excluded from the pre-training datasets. With
HistoDiffusion augmentation, the classification accuracy of a backbone
classifier is remarkably improved by 6.4% using a small set of the original
labels. Our code is available at https://github.com/karenyyy/HistoDiffAug.
Related papers
- Continuous Contrastive Learning for Long-Tailed Semi-Supervised Recognition [50.61991746981703]
Current state-of-the-art LTSSL approaches rely on high-quality pseudo-labels for large-scale unlabeled data.
This paper introduces a novel probabilistic framework that unifies various recent proposals in long-tail learning.
We introduce a continuous contrastive learning method, CCL, extending our framework to unlabeled data using reliable and smoothed pseudo-labels.
arXiv Detail & Related papers (2024-10-08T15:06:10Z) - Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation [0.0]
This paper introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space to improve discrimination capabilities.
By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance.
The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images.
arXiv Detail & Related papers (2024-09-16T13:47:52Z) - Dataset Distillation for Histopathology Image Classification [46.04496989951066]
We introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD)
We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks.
arXiv Detail & Related papers (2024-08-19T05:53:38Z) - Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images.
We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z) - Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - How Can We Tame the Long-Tail of Chest X-ray Datasets? [0.0]
Chest X-rays (CXRs) are a medical imaging modality that is used to infer a large number of abnormalities.
Few of them are quite commonly observed and are abundantly represented in CXR datasets.
It is challenging for current models to learn independent discriminatory features for labels that are rare but may be of high significance.
arXiv Detail & Related papers (2023-09-08T12:28:40Z) - DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch
Diffusion in Histopathology [10.412322654017313]
We present DiffInfinite, a hierarchical diffusion model that generates arbitrarily large histological images.
The proposed sampling method can be scaled up to any desired image size while only requiring small patches for fast training.
arXiv Detail & Related papers (2023-06-23T09:10:41Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.