Related papers: Zero-shot spatial layout conditioning for text-to-image diffusion models

Zero-shot spatial layout conditioning for text-to-image diffusion models

URL: http://arxiv.org/abs/2306.13754v1
Date: Fri, 23 Jun 2023 19:24:48 GMT
Title: Zero-shot spatial layout conditioning for text-to-image diffusion models
Authors: Guillaume Couairon, Marl\`ene Careil, Matthieu Cord, St\'ephane Lathuili\`ere, Jakob Verbeek
Abstract summary: Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
Score: 52.24744018240424
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

Related papers

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation. We introduce Contrastive Soft Clustering to align masks with the image's structure information. InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z)
Improving Compositional Text-to-image Generation with Large Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts. We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z)
Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z)
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z)
Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z)
Variational Distribution Learning for Unsupervised Text-to-Image Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. We train an ensemble of text-to-image diffusion models specialized for different stages synthesis. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces. We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.