StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners
- URL: http://arxiv.org/abs/2306.00984v2
- Date: Thu, 26 Oct 2023 15:16:57 GMT
- Title: StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners
- Authors: Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan
- Abstract summary: We show that training self-supervised methods on synthetic images can match or beat the real image counterpart.
We develop a multi-positive contrastive learning method, which we call StableRep.
With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP.
- Score: 58.941838860425754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the potential of learning visual representations using
synthetic images generated by text-to-image models. This is a natural question
in the light of the excellent performance of such models in generating
high-quality images. We consider specifically the Stable Diffusion, one of the
leading open source text-to-image models. We show that (1) when the generative
model is configured with proper classifier-free guidance scale, training
self-supervised methods on synthetic images can match or beat the real image
counterpart; (2) by treating the multiple images generated from the same text
prompt as positives for each other, we develop a multi-positive contrastive
learning method, which we call StableRep. With solely synthetic images, the
representations learned by StableRep surpass the performance of representations
learned by SimCLR and CLIP using the same set of text prompts and corresponding
real images, on large scale datasets. When we further add language supervision,
StableRep trained with 20M synthetic images achieves better accuracy than CLIP
trained with 50M real images.
Related papers
- Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts.
We show that, for some attributes, images can represent concepts more expressively than text.
We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.