Freestyle Layout-to-Image Synthesis
- URL: http://arxiv.org/abs/2303.14412v1
- Date: Sat, 25 Mar 2023 09:37:41 GMT
- Title: Freestyle Layout-to-Image Synthesis
- Authors: Han Xue, Zhiwu Huang, Qianru Sun, Li Song, Wenjun Zhang
- Abstract summary: In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics.
The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
- Score: 42.64485133926378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typical layout-to-image synthesis (LIS) models generate images for a closed
set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work,
we explore the freestyle capability of the model, i.e., how far can it generate
unseen semantics (e.g., classes, attributes, and styles) onto a given layout,
and call the task Freestyle LIS (FLIS). Thanks to the development of
large-scale pre-trained language-image models, a number of discriminative
models (e.g., image classification and object detection) trained on limited
base classes are empowered with the ability of unseen class prediction.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image
diffusion models to achieve the generation of unseen semantics. The key
challenge of FLIS is how to enable the diffusion model to synthesize images
from a specific layout which very likely violates its pre-learned knowledge,
e.g., the model never sees "a unicorn sitting on a bench" during its
pre-training. To this end, we introduce a new module called Rectified
Cross-Attention (RCA) that can be conveniently plugged in the diffusion model
to integrate semantic masks. This "plug-in" is applied in each cross-attention
layer of the model to rectify the attention maps between image and text tokens.
The key idea of RCA is to enforce each text token to act on the pixels in a
specified region, allowing us to freely put a wide variety of semantics from
pre-trained knowledge (which is general) onto the given layout (which is
specific). Extensive experiments show that the proposed diffusion network
produces realistic and freestyle layout-to-image generation results with
diverse text inputs, which has a high potential to spawn a bunch of interesting
applications. Code is available at https://github.com/essunny310/FreestyleNet.
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z) - Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors [40.959642112729234]
Peekaboo is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique.
We show how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images.
arXiv Detail & Related papers (2022-11-23T18:59:05Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.