Related papers: Your Image Generator Is Your New Private Dataset

Your Image Generator Is Your New Private Dataset

URL: http://arxiv.org/abs/2504.04582v2
Date: Tue, 08 Apr 2025 08:35:53 GMT
Title: Your Image Generator Is Your New Private Dataset
Authors: Nicolo Resmini, Eugenio Lomurno, Cristian Sbrolli, Matteo Matteucci,
Abstract summary: Generative diffusion models have emerged as powerful tools to synthetically produce training data.<n>This paper proposes the Text-Conditioned Knowledge Recycling pipeline to tackle these challenges.<n>The pipeline is rigorously evaluated on ten diverse image classification benchmarks.
Score: 4.09225917049674
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.

Related papers

Feature-to-Image Data Augmentation: Improving Model Feature Extraction with Cluster-Guided Synthetic Samples [4.041834517339835]
This study introduces FICAug, a novel feature-to-image data augmentation framework. It is designed to improve model generalization under limited data conditions by generating structured synthetic samples. Experimental results demonstrate that FICAug significantly improves classification accuracy.
arXiv Detail & Related papers (2024-09-26T09:51:08Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks [5.0243930429558885]
This paper introduces Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases.
arXiv Detail & Related papers (2024-07-22T10:31:07Z)
DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z)
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z)
Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability. We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images. Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z)
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets. We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z)
Improving the Effectiveness of Deep Generative Data [5.856292656853396]
Training a model on purely synthetic images for downstream image processing tasks results in an undesired performance drop compared to training on real data. We propose a new taxonomy to describe factors contributing to this commonly observed phenomenon and investigate it on the popular CIFAR-10 dataset. Our method outperforms baselines on downstream classification tasks both in case of training on synthetic only (Synthetic-to-Real) and training on a mix of real and synthetic data.
arXiv Detail & Related papers (2023-11-07T12:57:58Z)
From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models. textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z)
Leaving Reality to Imagination: Robust Classification via Generated Datasets [24.411444438920988]
Recent research on robustness has revealed significant performance gaps between neural image classifiers trained on datasets similar to the test set. We study the question: How do generated datasets influence the natural robustness of image classifiers? We find that Imagenet classifiers trained on real data augmented with generated data achieve higher accuracy and effective robustness than standard training.
arXiv Detail & Related papers (2023-02-05T22:49:33Z)
Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.