Generating Compositional Scenes via Text-to-image RGBA Instance   Generation
        - URL: http://arxiv.org/abs/2411.10913v1
- Date: Sat, 16 Nov 2024 23:44:14 GMT
- Title: Generating Compositional Scenes via Text-to-image RGBA Instance   Generation
- Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot, 
- Abstract summary: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
- Score: 82.63805151691024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods. 
 
      
        Related papers
        - ART: Anonymous Region Transformer for Variable Multi-Layer Transparent   Image Generation [108.69315278353932]
 We introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images.
By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
 arXiv  Detail & Related papers  (2025-02-25T16:57:04Z)
- ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled   Multimodal Conditions [74.30040551058319]
 ComposeAnyone is a controllable layout-to-human generation method with decoupled multimodal conditions.
Our dataset provides decoupled text and reference image annotations for different components of each human image.
Experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts.
 arXiv  Detail & Related papers  (2025-01-21T14:32:47Z)
- UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal   Transformer for Image Generation [64.8341372591993]
 We propose a new approach to unify controllable generation within a single framework.
Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.
Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
 arXiv  Detail & Related papers  (2024-12-25T15:19:02Z)
- OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
 OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.
At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone.
OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
 arXiv  Detail & Related papers  (2024-11-22T17:55:15Z)
- OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal   Instruction [32.08995899903304]
 We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
 arXiv  Detail & Related papers  (2024-10-07T11:26:13Z)
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image   Generation [24.07613591217345]
 Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
 arXiv  Detail & Related papers  (2024-06-27T07:40:59Z)
- MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image   Generation [54.64194935409982]
 We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions.
MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images.
We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
 arXiv  Detail & Related papers  (2024-04-03T14:58:00Z)
- SceneX: Procedural Controllable Large-scale Scene Generation [52.4743878200172]
 We introduce SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.
The proposed method comprises two components, PCGHub and PCGPlanner.
The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions.
 arXiv  Detail & Related papers  (2024-03-23T03:23:29Z)
- Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
 We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.
We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images.
We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
 arXiv  Detail & Related papers  (2024-02-22T18:50:18Z)
- Divide and Conquer: Language Models can Plan and Self-Correct for
  Compositional Text-to-Image Generation [72.6168579583414]
 CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core.
Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
 arXiv  Detail & Related papers  (2024-01-28T16:18:39Z)
- Mastering Text-to-Image Diffusion: Recaptioning, Planning, and   Generating with Multimodal LLMs [77.86214400258473]
 We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG)
RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models.
Our framework exhibits wide compatibility with various MLLM architectures.
 arXiv  Detail & Related papers  (2024-01-22T06:16:29Z)
- Identifying Systematic Errors in Object Detectors with the SCROD
  Pipeline [46.52729366461028]
 The identification and removal of systematic errors in object detectors can be a prerequisite for their deployment in safety-critical applications.
We overcome this limitation by generating synthetic images with fine-granular control.
We propose a novel framework that combines the strengths of both approaches.
 arXiv  Detail & Related papers  (2023-09-23T22:41:08Z)
- MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
 MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
 arXiv  Detail & Related papers  (2023-02-16T06:28:29Z)
- AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable
  Layout and Style [5.912209564607099]
 We propose a method for attribute controlled image synthesis from layout.
We extend a state-of-the-art approach for layout-to-image generation to condition individual objects on attributes.
Our results show that our method can successfully control the fine-grained details of individual objects when modelling complex scenes with multiple objects.
 arXiv  Detail & Related papers  (2021-03-25T10:09:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.