Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification
- URL: http://arxiv.org/abs/2508.09550v1
- Date: Wed, 13 Aug 2025 07:14:29 GMT
- Title: Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification
- Authors: Haowen Wang, Guowei Zhang, Xiang Zhang, Zeyuan Chen, Haiyang Xu, Dou Hoon Kwark, Zhuowen Tu,
- Abstract summary: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance?<n>We explore the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models.<n>We empirically determine the equivalent scale of synthetic images needed for augmentation.
- Score: 39.77627656310901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.
Related papers
- Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification [1.7614607439356635]
We explore the usefulness of synthetic data generated with different generative models from Deep Learning.<n>We investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then adding increasing proportions of real data.
arXiv Detail & Related papers (2024-11-27T15:46:34Z) - Unleashing the Potential of Synthetic Images: A Study on Histopathology Image Classification [0.12499537119440242]
Histopathology image classification is crucial for the accurate identification and diagnosis of various diseases.
We show that synthetic images can effectively augment existing datasets, ultimately improving the performance of the downstream histopathology image classification task.
arXiv Detail & Related papers (2024-09-24T12:02:55Z) - DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - Scaling Laws of Synthetic Images for Model Training ... for Now [54.43596959598466]
We study the scaling laws of synthetic images generated by state of the art text-to-image models.
We observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training.
arXiv Detail & Related papers (2023-12-07T18:59:59Z) - Improving the Effectiveness of Deep Generative Data [5.856292656853396]
Training a model on purely synthetic images for downstream image processing tasks results in an undesired performance drop compared to training on real data.
We propose a new taxonomy to describe factors contributing to this commonly observed phenomenon and investigate it on the popular CIFAR-10 dataset.
Our method outperforms baselines on downstream classification tasks both in case of training on synthetic only (Synthetic-to-Real) and training on a mix of real and synthetic data.
arXiv Detail & Related papers (2023-11-07T12:57:58Z) - Unified Framework for Histopathology Image Augmentation and Classification via Generative Models [6.404713841079193]
We propose an innovative unified framework that integrates the data generation and model training stages into a unified process.
Our approach utilizes a pure Vision Transformer (ViT)-based conditional Generative Adversarial Network (cGAN) model to simultaneously handle both image synthesis and classification.
Our experiments show that our unified synthetic augmentation framework consistently enhances the performance of histopathology image classification models.
arXiv Detail & Related papers (2022-12-20T03:40:44Z) - Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks.
We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.