Related papers: How far can we go with ImageNet for Text-to-Image generation?

How far can we go with ImageNet for Text-to-Image generation?

URL: http://arxiv.org/abs/2502.21318v2
Date: Wed, 21 May 2025 15:26:34 GMT
Title: How far can we go with ImageNet for Text-to-Image generation?
Authors: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton,
Abstract summary: We show that one can match or outperform models trained on massive web-scraped collections using only ImageNet enhanced with well-designed text and image augmentations.<n>We achieve a +1% overall score over SD-XL on GenEval and +0.5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent text-to-image generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can match or outperform models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +1% overall score over SD-XL on GenEval and +0.5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. This opens the way for more reproducible research as ImageNet is a widely available dataset and our standardized training setup does not require massive compute resources.

Related papers

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [89.19449553099747]
We study the problem of Text-to-Image In-Context Learning (T2I-ICL) We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation. We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
arXiv Detail & Related papers (2025-03-25T03:18:46Z)
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks [51.439283251703635]
We create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data.<n>Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models.<n>We show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation.
arXiv Detail & Related papers (2025-02-24T13:51:06Z)
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training [77.681908636429]
Text-to-image (T2I) models face several limitations, including large model sizes, slow, and low-quality generation on mobile devices. This paper aims to develop an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms.
arXiv Detail & Related papers (2024-12-12T18:59:53Z)
Data Extrapolation for Text-to-image Generation on Small Datasets [3.7356387436951146]
We propose a new data augmentation method for text-to-image generation using linear extrapolation. We construct training samples dozens of times larger than the original dataset. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets.
arXiv Detail & Related papers (2024-10-02T15:08:47Z)
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z)
xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details. We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z)
Large-scale Dataset Pruning with Dynamic Uncertainty [28.60845105174658]
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop.
arXiv Detail & Related papers (2023-06-08T13:14:35Z)
Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks. We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K. We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint. The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z)
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations [89.42397034542189]
We synthesize a large labeled dataset via a generative adversarial network (GAN) We take image samples from the class-conditional generative model BigGAN trained on ImageNet, and manually annotate 5 images per class, for all 1k classes. We create a new ImageNet benchmark by labeling an additional set of 8k real images and evaluate segmentation performance in a variety of settings.
arXiv Detail & Related papers (2022-01-12T20:28:34Z)
ImageNet-21K Pretraining for the Masses [12.339884639594624]
ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset contains more pictures and classes. This paper aims to make high-quality efficient pretraining on ImageNet-21K available for everyone.
arXiv Detail & Related papers (2021-04-22T10:10:14Z)
Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn. SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z)
Shape-Texture Debiased Neural Network Training [50.6178024087048]
Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. We develop an algorithm for shape-texture debiased learning. Experiments show that our method successfully improves model performance on several image recognition benchmarks.
arXiv Detail & Related papers (2020-10-12T19:16:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.