Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2311.17937v1
- Date: Tue, 28 Nov 2023 19:00:02 GMT
- Title: Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
- Authors: Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M.
Snoek, Victor R\"uhle
- Abstract summary: CompFuser is an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models.
Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene.
- Score: 33.99474729408903
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose CompFuser, an image generation pipeline that enhances spatial
comprehension and attribute assignment in text-to-image generative models. Our
pipeline enables the interpretation of instructions defining spatial
relationships between objects in a scene, such as `An image of a gray cat on
the left of an orange dog', and generate corresponding images. This is
especially important in order to provide more control to the user. CompFuser
overcomes the limitation of existing text-to-image diffusion models by decoding
the generation of multiple objects into iterative steps: first generating a
single object and then editing the image by placing additional objects in their
designated positions. To create training data for spatial comprehension and
attribute assignment we introduce a synthetic data generation process, that
leverages a frozen large language model and a frozen layout-based diffusion
model for object placement. We compare our approach to strong baselines and
show that our model outperforms state-of-the-art image generation models in
spatial comprehension and attribute assignment, despite being 3x to 5x smaller
in parameters.
Related papers
- SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects [20.978091381109294]
We propose a method to generate articulated objects from a single image.
Our method generates an articulated object that is visually consistent with the input image.
Our experiments show that our method outperforms the state-of-the-art in articulated object creation.
arXiv Detail & Related papers (2024-10-21T20:41:32Z) - Paint by Inpaint: Learning to Add Image Objects by Removing Them First [8.399234415641319]
We train a diffusion model to inverse the inpainting process, effectively adding objects into images.
We provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions.
arXiv Detail & Related papers (2024-04-28T15:07:53Z) - Salient Object-Aware Background Generation using Text-Guided Diffusion Models [4.747826159446815]
We present a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures.
Our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
arXiv Detail & Related papers (2024-04-15T22:13:35Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - ObjectComposer: Consistent Generation of Multiple Objects Without
Fine-tuning [25.033615513933192]
We introduce ObjectComposer for generating compositions of multiple objects that resemble user-specified images.
Our approach is training-free, leveraging the abilities of preexisting models.
arXiv Detail & Related papers (2023-10-10T19:46:58Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.