Related papers: How far can we go with ImageNet for Text-to-Image generation?

How far can we go with ImageNet for Text-to-Image generation?

URL: http://arxiv.org/abs/2502.21318v3
Date: Thu, 02 Oct 2025 13:14:12 GMT
Title: How far can we go with ImageNet for Text-to-Image generation?
Authors: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton,
Abstract summary: We show that one can achieve capabilities of models trained on massive web-scraped collections using only ImageNet enhanced with well-designed text and image augmentations.<n>With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images.
Score: 0.5437050212139086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

Related papers

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [89.19449553099747]
We study the problem of Text-to-Image In-Context Learning (T2I-ICL) We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation. We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
arXiv Detail & Related papers (2025-03-25T03:18:46Z)
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks [51.439283251703635]
We create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data.<n>Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models.<n>We show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation.
arXiv Detail & Related papers (2025-02-24T13:51:06Z)
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training [77.681908636429]
Text-to-image (T2I) models face several limitations, including large model sizes, slow, and low-quality generation on mobile devices. This paper aims to develop an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms.
arXiv Detail & Related papers (2024-12-12T18:59:53Z)
CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation [69.43106794519193]
We propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions.<n>Our framework reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights.
arXiv Detail & Related papers (2024-10-12T07:04:32Z)
Data Extrapolation for Text-to-image Generation on Small Datasets [3.7356387436951146]
We propose a new data augmentation method for text-to-image generation using linear extrapolation. We construct training samples dozens of times larger than the original dataset. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets.
arXiv Detail & Related papers (2024-10-02T15:08:47Z)
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z)
xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details. We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z)
Large-scale Dataset Pruning with Dynamic Uncertainty [28.60845105174658]
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop.
arXiv Detail & Related papers (2023-06-08T13:14:35Z)
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval [2.3013879633693266]
We propose a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result. Our experiments showed that HADA could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset.
arXiv Detail & Related papers (2023-01-11T22:25:20Z)
Fake it till you make it: Learning transferable representations from synthetic ImageNet clones [30.264601433216246]
We show that ImageNet clones can close a large part of the gap between models produced by synthetic images and models trained with real images. We show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data for transfer.
arXiv Detail & Related papers (2022-12-16T11:44:01Z)
Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks. We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K. We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint. The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z)
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations [89.42397034542189]
We synthesize a large labeled dataset via a generative adversarial network (GAN) We take image samples from the class-conditional generative model BigGAN trained on ImageNet, and manually annotate 5 images per class, for all 1k classes. We create a new ImageNet benchmark by labeling an additional set of 8k real images and evaluate segmentation performance in a variety of settings.
arXiv Detail & Related papers (2022-01-12T20:28:34Z)
LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model. We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
ImageNet-21K Pretraining for the Masses [12.339884639594624]
ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset contains more pictures and classes. This paper aims to make high-quality efficient pretraining on ImageNet-21K available for everyone.
arXiv Detail & Related papers (2021-04-22T10:10:14Z)
Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn. SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z)
Shape-Texture Debiased Neural Network Training [50.6178024087048]
Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. We develop an algorithm for shape-texture debiased learning. Experiments show that our method successfully improves model performance on several image recognition benchmarks.
arXiv Detail & Related papers (2020-10-12T19:16:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.