Synthetic Augmentation with Large-scale Unconditional Pre-training
- URL: http://arxiv.org/abs/2308.04020v1
- Date: Tue, 8 Aug 2023 03:34:04 GMT
- Title: Synthetic Augmentation with Large-scale Unconditional Pre-training
- Authors: Jiarong Ye, Haomiao Ni, Peng Jin, Sharon X. Huang, Yuan Xue
- Abstract summary: We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
- Score: 4.162192894410251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning based medical image recognition systems often require a
substantial amount of training data with expert annotations, which can be
expensive and time-consuming to obtain. Recently, synthetic augmentation
techniques have been proposed to mitigate the issue by generating realistic
images conditioned on class labels. However, the effectiveness of these methods
heavily depends on the representation capability of the trained generative
model, which cannot be guaranteed without sufficient labeled training data. To
further reduce the dependency on annotated data, we propose a synthetic
augmentation method called HistoDiffusion, which can be pre-trained on
large-scale unlabeled datasets and later applied to a small-scale labeled
dataset for augmented training. In particular, we train a latent diffusion
model (LDM) on diverse unlabeled datasets to learn common features and generate
realistic images without conditional inputs. Then, we fine-tune the model with
classifier guidance in latent space on an unseen labeled dataset so that the
model can synthesize images of specific categories. Additionally, we adopt a
selective mechanism to only add synthetic samples with high confidence of
matching to target labels. We evaluate our proposed method by pre-training on
three histopathology datasets and testing on a histopathology dataset of
colorectal cancer (CRC) excluded from the pre-training datasets. With
HistoDiffusion augmentation, the classification accuracy of a backbone
classifier is remarkably improved by 6.4% using a small set of the original
labels. Our code is available at https://github.com/karenyyy/HistoDiffAug.
Related papers
- Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - DreamDA: Generative Data Augmentation with Diffusion Models [68.22440150419003]
This paper proposes a new classification-oriented framework DreamDA.
DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds.
In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels.
arXiv Detail & Related papers (2024-03-19T15:04:35Z) - Group Distributionally Robust Dataset Distillation with Risk
Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - How Can We Tame the Long-Tail of Chest X-ray Datasets? [0.0]
Chest X-rays (CXRs) are a medical imaging modality that is used to infer a large number of abnormalities.
Few of them are quite commonly observed and are abundantly represented in CXR datasets.
It is challenging for current models to learn independent discriminatory features for labels that are rare but may be of high significance.
arXiv Detail & Related papers (2023-09-08T12:28:40Z) - DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch
Diffusion in Histopathology [10.412322654017313]
We present DiffInfinite, a hierarchical diffusion model that generates arbitrarily large histological images.
The proposed sampling method can be scaled up to any desired image size while only requiring small patches for fast training.
arXiv Detail & Related papers (2023-06-23T09:10:41Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Data Augmentation using Feature Generation for Volumetric Medical Images [0.08594140167290097]
Medical image classification is one of the most critical problems in the image recognition area.
One of the major challenges in this field is the scarcity of labelled training data.
Deep Learning models, in particular, show promising results on image segmentation and classification problems.
arXiv Detail & Related papers (2022-09-28T13:46:24Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.