Will Large-scale Generative Models Corrupt Future Datasets?
- URL: http://arxiv.org/abs/2211.08095v2
- Date: Thu, 10 Aug 2023 00:22:27 GMT
- Title: Will Large-scale Generative Models Corrupt Future Datasets?
- Authors: Ryuichiro Hataya and Han Bao and Hiromi Arai
- Abstract summary: Large-scale text-to-image generative models can generate high-quality and realistic images from users' prompts.
This paper empirically answers this question by simulating contamination.
We conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images.
- Score: 5.593352892211305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently proposed large-scale text-to-image generative models such as
DALL$\cdot$E 2, Midjourney, and StableDiffusion can generate high-quality and
realistic images from users' prompts. Not limited to the research community,
ordinary Internet users enjoy these generative models, and consequently, a
tremendous amount of generated images have been shared on the Internet.
Meanwhile, today's success of deep learning in the computer vision field owes a
lot to images collected from the Internet. These trends lead us to a research
question: "\textbf{will such generated images impact the quality of future
datasets and the performance of computer vision models positively or
negatively?}" This paper empirically answers this question by simulating
contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using
a state-of-the-art generative model and evaluate models trained with
"contaminated" datasets on various tasks, including image classification and
image generation. Throughout experiments, we conclude that generated images
negatively affect downstream performance, while the significance depends on
tasks and the amount of generated images. The generated datasets and the codes
for experiments will be publicly released for future research. Generated
datasets and source codes are available from
\url{https://github.com/moskomule/dataset-contamination}.
Related papers
- Community Forensics: Using Thousands of Generators to Train Fake Image Detectors [15.166026536032142]
One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models.
We propose a new dataset that is significantly larger and more diverse than prior work.
The resulting dataset contains 2.7M images that have been sampled from 4803 different models.
arXiv Detail & Related papers (2024-11-06T18:59:41Z) - DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - How to Trace Latent Generative Model Generated Images without Artificial Watermark? [88.04880564539836]
Concerns have arisen regarding potential misuse related to images generated by latent generative models.
We propose a latent inversion based method called LatentTracer to trace the generated images of the inspected model.
Our experiments show that our method can distinguish the images generated by the inspected model and other images with a high accuracy and efficiency.
arXiv Detail & Related papers (2024-05-22T05:33:47Z) - Would Deep Generative Models Amplify Bias in Future Models? [29.918422914275226]
We investigate the impact of deep generative models on potential social biases in upcoming computer vision models.
We conduct simulations by substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion.
Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias.
arXiv Detail & Related papers (2024-04-04T06:58:39Z) - Improving the Effectiveness of Deep Generative Data [5.856292656853396]
Training a model on purely synthetic images for downstream image processing tasks results in an undesired performance drop compared to training on real data.
We propose a new taxonomy to describe factors contributing to this commonly observed phenomenon and investigate it on the popular CIFAR-10 dataset.
Our method outperforms baselines on downstream classification tasks both in case of training on synthetic only (Synthetic-to-Real) and training on a mix of real and synthetic data.
arXiv Detail & Related papers (2023-11-07T12:57:58Z) - Fake it till you make it: Learning transferable representations from
synthetic ImageNet clones [30.264601433216246]
We show that ImageNet clones can close a large part of the gap between models produced by synthetic images and models trained with real images.
We show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data for transfer.
arXiv Detail & Related papers (2022-12-16T11:44:01Z) - Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks.
We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.