Discriminative Class Tokens for Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2303.17155v3
- Date: Sun, 10 Sep 2023 17:33:30 GMT
- Title: Discriminative Class Tokens for Text-to-Image Diffusion Models
- Authors: Idan Schwartz, V\'esteinn Sn{\ae}bjarnarson, Hila Chefer, Ryan
Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim
- Abstract summary: We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
- Score: 107.98436819341592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-image diffusion models have enabled the generation
of diverse and high-quality images. While impressive, the images often fall
short of depicting subtle details and are susceptible to errors due to
ambiguity in the input text. One way of alleviating these issues is to train
diffusion models on class-labeled datasets. This approach has two
disadvantages: (i) supervised datasets are generally small compared to
large-scale scraped text-image datasets on which text-to-image models are
trained, affecting the quality and diversity of the generated images, or (ii)
the input is a hard-coded label, as opposed to free-form text, limiting the
control over the generated images.
In this work, we propose a non-invasive fine-tuning technique that
capitalizes on the expressive potential of free-form text while achieving high
accuracy through discriminative signals from a pretrained classifier. This is
done by iteratively modifying the embedding of an added input token of a
text-to-image diffusion model, by steering generated images toward a given
target class according to a classifier. Our method is fast compared to prior
fine-tuning methods and does not require a collection of in-class images or
retraining of a noise-tolerant classifier. We evaluate our method extensively,
showing that the generated images are: (i) more accurate and of higher quality
than standard diffusion models, (ii) can be used to augment training data in a
low-resource setting, and (iii) reveal information about the data used to train
the guiding classifier. The code is available at
\url{https://github.com/idansc/discriminative_class_tokens}.
Related papers
- UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Cap2Aug: Caption guided Image to Image data Augmentation [41.53127698828463]
Cap2Aug is an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts.
We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model.
This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples.
arXiv Detail & Related papers (2022-12-11T04:37:43Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.