Counting Guidance for High Fidelity Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2306.17567v1
- Date: Fri, 30 Jun 2023 11:40:35 GMT
- Title: Counting Guidance for High Fidelity Text-to-Image Synthesis
- Authors: Wonjun Kang, Kevin Galim, Hyung Il Koo
- Abstract summary: Text-to-image diffusion models fail to generate high fidelity content with respect to the input prompt.
E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects.
We propose a method to improve diffusion models to focus on producing the correct object count.
- Score: 2.6212127510234797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, the quality and performance of text-to-image generation
significantly advanced due to the impressive results of diffusion models.
However, text-to-image diffusion models still fail to generate high fidelity
content with respect to the input prompt. One problem where text-to-diffusion
models struggle is generating the exact number of objects specified in the text
prompt. E.g. given a prompt "five apples and ten lemons on a table",
diffusion-generated images usually contain the wrong number of objects. In this
paper, we propose a method to improve diffusion models to focus on producing
the correct object count given the input prompt. We adopt a counting network
that performs reference-less class-agnostic counting for any given image. We
calculate the gradients of the counting network and refine the predicted noise
for each step. To handle multiple types of objects in the prompt, we use novel
attention map guidance to obtain high-fidelity masks for each object. Finally,
we guide the denoising process by the calculated gradients for each object.
Through extensive experiments and evaluation, we demonstrate that our proposed
guidance method greatly improves the fidelity of diffusion models to object
count.
Related papers
- Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Semantic Generative Augmentations for Few-Shot Counting [0.0]
We investigate how synthetic data can benefit few-shot class-agnostic counting.
We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map.
Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models.
arXiv Detail & Related papers (2023-10-26T11:42:48Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - SYRAC: Synthesize, Rank, and Count [19.20599654208014]
We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data.
We report state-of-the-art results for unsupervised crowd counting.
arXiv Detail & Related papers (2023-10-02T21:52:47Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.