Scene Graph Conditioning in Latent Diffusion
- URL: http://arxiv.org/abs/2310.10338v1
- Date: Mon, 16 Oct 2023 12:26:01 GMT
- Title: Scene Graph Conditioning in Latent Diffusion
- Authors: Frank Fundel
- Abstract summary: Diffusion models excel in image generation but lack detailed semantic control using text prompts.
In contrast, scene graphs offer a more precise representation of image content.
We show that using out proposed methods it is possible to generate images from scene graphs with much higher quality.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models excel in image generation but lack detailed semantic control
using text prompts. Additional techniques have been developed to address this
limitation. However, conditioning diffusion models solely on text-based
descriptions is challenging due to ambiguity and lack of structure. In
contrast, scene graphs offer a more precise representation of image content,
making them superior for fine-grained control and accurate synthesis in image
generation models. The amount of image and scene-graph data is sparse, which
makes fine-tuning large diffusion models challenging. We propose multiple
approaches to tackle this problem using ControlNet and Gated Self-Attention. We
were able to show that using out proposed methods it is possible to generate
images from scene graphs with much higher quality, outperforming previous
methods. Our source code is publicly available on
https://github.com/FrankFundel/SGCond
Related papers
- Diffusion Self-Distillation for Zero-Shot Customized Image Generation [40.11194010431839]
Diffusion Self-Distillation is a method for generating its own dataset for text-conditioned image-to-image tasks.
We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images.
We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset.
arXiv Detail & Related papers (2024-11-27T18:58:52Z) - Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [102.88033622546251]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.