High-Fidelity Guided Image Synthesis with Latent Diffusion Models
- URL: http://arxiv.org/abs/2211.17084v1
- Date: Wed, 30 Nov 2022 15:43:20 GMT
- Title: High-Fidelity Guided Image Synthesis with Latent Diffusion Models
- Authors: Jaskirat Singh, Stephen Gould, Liang Zheng
- Abstract summary: The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
- Score: 50.39294302741698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controllable image synthesis with user scribbles has gained huge public
interest with the recent advent of text-conditioned latent diffusion models.
The user scribbles control the color composition while the text prompt provides
control over the overall image semantics. However, we note that prior works in
this direction suffer from an intrinsic domain shift problem, wherein the
generated outputs often lack details and resemble simplistic representations of
the target domain. In this paper, we propose a novel guided image synthesis
framework, which addresses this problem by modeling the output image as the
solution of a constrained optimization problem. We show that while computing an
exact solution to the optimization is infeasible, an approximation of the same
can be achieved while just requiring a single pass of the reverse diffusion
process. Additionally, we show that by simply defining a cross-attention based
correspondence between the input text tokens and the user stroke-painting, the
user is also able to control the semantics of different painted regions without
requiring any conditional training or finetuning. Human user study results show
that the proposed approach outperforms the previous state-of-the-art by over
85.32% on the overall user satisfaction scores. Project page for our paper is
available at https://1jsingh.github.io/gradop.
Related papers
- Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models [46.18013380882767]
This work focuses on inverting the diffusion model to obtain interpretable language prompts directly.
We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
arXiv Detail & Related papers (2023-12-19T18:47:30Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - ImageBART: Bidirectional Context with Multinomial Diffusion for
Autoregressive Image Synthesis [15.006676130258372]
Autoregressive models incorporate context in a linear 1D order by attending only to previously synthesized image patches above or to the left.
We propose a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process.
Our approach can take unrestricted, user-provided masks into account to perform local image editing.
arXiv Detail & Related papers (2021-08-19T17:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.