Sketch-Guided Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2211.13752v1
- Date: Thu, 24 Nov 2022 18:45:32 GMT
- Title: Sketch-Guided Text-to-Image Diffusion Models
- Authors: Andrey Voynov, Kfir Aberman, Daniel Cohen-Or
- Abstract summary: We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
- Score: 57.12095262189362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image models have introduced a remarkable leap in the evolution of
machine learning, demonstrating high-quality synthesis of images from a given
text-prompt. However, these powerful pretrained models still lack control
handles that can guide spatial properties of the synthesized images. In this
work, we introduce a universal approach to guide a pretrained text-to-image
diffusion model, with a spatial map from another domain (e.g., sketch) during
inference time. Unlike previous works, our method does not require to train a
dedicated model or a specialized encoder for the task. Our key idea is to train
a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron
(MLP) that maps latent features of noisy images to spatial maps, where the deep
features are extracted from the core Denoising Diffusion Probabilistic Model
(DDPM) network. The LGP is trained only on a few thousand images and
constitutes a differential guiding map predictor, over which the loss is
computed and propagated back to push the intermediate images to agree with the
spatial map. The per-pixel training offers flexibility and locality which
allows the technique to perform well on out-of-domain sketches, including
free-hand style drawings. We take a particular focus on the sketch-to-image
translation task, revealing a robust and expressive way to generate images that
follow the guidance of a sketch of arbitrary style or domain. Project page:
sketch-guided-diffusion.github.io
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Few-Shot Learning with Visual Distribution Calibration and Cross-Modal
Distribution Alignment [47.53887941065894]
Pre-trained vision-language models have inspired much research on few-shot learning.
With only a few training images, the visual feature distributions are easily distracted by class-irrelevant information in images.
We propose a Selective Attack module that generates spatial attention maps of images to guide the attacks on class-irrelevant image areas.
arXiv Detail & Related papers (2023-05-19T05:45:17Z) - Freestyle Layout-to-Image Synthesis [42.64485133926378]
In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics.
The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
arXiv Detail & Related papers (2023-03-25T09:37:41Z) - HandNeRF: Neural Radiance Fields for Animatable Interacting Hands [122.32855646927013]
We propose a novel framework to reconstruct accurate appearance and geometry with neural radiance fields (NeRF) for interacting hands.
We conduct extensive experiments to verify the merits of our proposed HandNeRF and report a series of state-of-the-art results.
arXiv Detail & Related papers (2023-03-24T06:19:19Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.