Related papers: Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

URL: http://arxiv.org/abs/2505.21574v1
Date: Tue, 27 May 2025 07:27:03 GMT
Title: Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models
Authors: Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman,
Abstract summary: We show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset.<n>Our method boosts the performance by up to2.8% in a variety of scenarios.<n>It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.
Score: 12.472871440252105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts the performance by up to 2.8% in a variety of scenarios, including training ResNet, ViT and DenseNet on CIFAR-10, CIFAR-100, and TinyImageNet, with a range of optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet. It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.

Related papers

Soft Augmentation for Image Classification [68.71067594724663]
We propose generalizing augmentation with invariant transforms to soft augmentation. We show that soft targets allow for more aggressive data augmentation. We also show that soft augmentations generalize to self-supervised classification tasks.
arXiv Detail & Related papers (2022-11-09T01:04:06Z)
SAGE: Saliency-Guided Mixup with Optimal Rearrangements [22.112463794733188]
Saliency-Guided Mixup with Optimal Rearrangements (SAGE) SAGE creates new training examples by rearranging and mixing image pairs using visual saliency as guidance. We demonstrate on CIFAR-10 and CIFAR-100 that SAGE achieves better or comparable performance to the state of the art while being more efficient.
arXiv Detail & Related papers (2022-10-31T19:45:21Z)
Data-Efficient Augmentation for Training Neural Networks [15.870155099135538]
We propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation. Our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes.
arXiv Detail & Related papers (2022-10-15T19:32:20Z)
You Only Cut Once: Boosting Data Augmentation with a Single Cut [85.90978190685837]
We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information.
arXiv Detail & Related papers (2022-01-28T12:34:40Z)
To be Critical: Self-Calibrated Weakly Supervised Learning for Salient Object Detection [95.21700830273221]
Weakly-supervised salient object detection (WSOD) aims to develop saliency models using image-level annotations. We propose a self-calibrated training strategy by explicitly establishing a mutual calibration loop between pseudo labels and network predictions. We prove that even a much smaller dataset with well-matched annotations can facilitate models to achieve better performance as well as generalizability.
arXiv Detail & Related papers (2021-09-04T02:45:22Z)
Learning Representational Invariances for Data-Efficient Action Recognition [52.23716087656834]
We show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets. We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.
arXiv Detail & Related papers (2021-03-30T17:59:49Z)
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly [114.81028176850404]
Training generative adversarial networks (GANs) with limited data generally results in deteriorated performance and collapsed models. We decompose the data-hungry GAN training into two sequential sub-problems. Such a coordinated framework enables us to focus on lower-complexity and more data-efficient sub-problems.
arXiv Detail & Related papers (2021-02-28T05:20:29Z)
Dataset Condensation with Differentiable Siamese Augmentation [30.571335208276246]
We focus on condensing large training sets into significantly smaller synthetic sets which can be used to train deep neural networks. We propose Differentiable Siamese Augmentation that enables effective use of data augmentation to synthesize more informative synthetic images. We show with only less than 1% data that our method achieves 99.6%, 94.9%, 88.5%, 71.5% relative performance on MNIST, FashionMNIST, SVHN, CIFAR10 respectively.
arXiv Detail & Related papers (2021-02-16T16:32:21Z)
Multiclass non-Adversarial Image Synthesis, with Application to Classification from Very Small Sample [6.243995448840211]
We present a novel non-adversarial generative method - Clustered Optimization of LAtent space (COLA) In the full data regime, our method is capable of generating diverse multi-class images with no supervision. In the small-data regime, where only a small sample of labeled images is available for training with no access to additional unlabeled data, our results surpass state-of-the-art GAN models trained on the same amount of data.
arXiv Detail & Related papers (2020-11-25T18:47:27Z)
Differentiable Augmentation for Data-Efficient GAN Training [48.920992130257595]
We propose DiffAugment, a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms.
arXiv Detail & Related papers (2020-06-18T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.