SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
- URL: http://arxiv.org/abs/2304.14573v1
- Date: Fri, 28 Apr 2023 00:14:28 GMT
- Title: SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
- Authors: Azade Farshad, Yousef Yeganeh, Yu Chi, Chengzhi Shen, Bj\"orn Ommer,
Nassir Navab
- Abstract summary: We propose a novel guidance approach for the sampling process in the diffusion model.
Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints.
Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
- Score: 38.22195812238951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-conditioned image generation has made significant progress in recent
years with generative adversarial networks and more recently, diffusion models.
While diffusion models conditioned on text prompts have produced impressive and
high-quality images, accurately representing complex text prompts such as the
number of instances of a specific object remains challenging.
To address this limitation, we propose a novel guidance approach for the
sampling process in the diffusion model that leverages bounding box and
segmentation map information at inference time without additional training
data. Through a novel loss in the sampling process, our approach guides the
model with semantic features from CLIP embeddings and enforces geometric
constraints, leading to high-resolution images that accurately represent the
scene. To obtain bounding box and segmentation map information, we structure
the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our
proposed model achieves state-of-the-art performance on two public benchmarks
for image generation from scene graphs, surpassing both scene graph to image
and text-based diffusion models in various metrics. Our results demonstrate the
effectiveness of incorporating bounding box and segmentation map guidance in
the diffusion model sampling process for more accurate text-to-image
generation.
Related papers
- Generating Intermediate Representations for Compositional Text-To-Image Generation [16.757550214291015]
We propose a compositional approach for text-to-image generation based on two stages.
In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations conditioned on text.
In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model.
arXiv Detail & Related papers (2024-10-13T10:24:55Z) - Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.