Generating Compositional Scenes via Text-to-image RGBA Instance Generation
- URL: http://arxiv.org/abs/2411.10913v1
- Date: Sat, 16 Nov 2024 23:44:14 GMT
- Title: Generating Compositional Scenes via Text-to-image RGBA Instance Generation
- Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot,
- Abstract summary: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
- Score: 82.63805151691024
- License:
- Abstract: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.
Related papers
- ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions [74.30040551058319]
ComposeAnyone is a controllable layout-to-human generation method with decoupled multimodal conditions.
Our dataset provides decoupled text and reference image annotations for different components of each human image.
Experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts.
arXiv Detail & Related papers (2025-01-21T14:32:47Z) - UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.
Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.
Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.
At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone.
OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z) - MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions.
MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images.
We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z) - SceneX: Procedural Controllable Large-scale Scene Generation [52.4743878200172]
We introduce SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.
The proposed method comprises two components, PCGHub and PCGPlanner.
The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions.
arXiv Detail & Related papers (2024-03-23T03:23:29Z) - Identifying Systematic Errors in Object Detectors with the SCROD
Pipeline [46.52729366461028]
The identification and removal of systematic errors in object detectors can be a prerequisite for their deployment in safety-critical applications.
We overcome this limitation by generating synthetic images with fine-granular control.
We propose a novel framework that combines the strengths of both approaches.
arXiv Detail & Related papers (2023-09-23T22:41:08Z) - MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z) - AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable
Layout and Style [5.912209564607099]
We propose a method for attribute controlled image synthesis from layout.
We extend a state-of-the-art approach for layout-to-image generation to condition individual objects on attributes.
Our results show that our method can successfully control the fine-grained details of individual objects when modelling complex scenes with multiple objects.
arXiv Detail & Related papers (2021-03-25T10:09:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.