Related papers: Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

URL: http://arxiv.org/abs/2411.10913v1
Date: Sat, 16 Nov 2024 23:44:14 GMT
Title: Generating Compositional Scenes via Text-to-image RGBA Instance Generation
Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot,
Abstract summary: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
Score: 82.63805151691024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Related papers

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation [108.69315278353932]
We introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
arXiv Detail & Related papers (2025-02-25T16:57:04Z)
ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions [74.30040551058319]
ComposeAnyone is a controllable layout-to-human generation method with decoupled multimodal conditions. Our dataset provides decoupled text and reference image annotations for different components of each human image. Experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts.
arXiv Detail & Related papers (2025-01-21T14:32:47Z)
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z)
OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone. OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z)
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions. MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images. We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z)
SceneX: Procedural Controllable Large-scale Scene Generation [52.4743878200172]
We introduce SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions. The proposed method comprises two components, PCGHub and PCGPlanner. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions.
arXiv Detail & Related papers (2024-03-23T03:23:29Z)
Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z)
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core. Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z)
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG) RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z)
Identifying Systematic Errors in Object Detectors with the SCROD Pipeline [46.52729366461028]
The identification and removal of systematic errors in object detectors can be a prerequisite for their deployment in safety-critical applications. We overcome this limitation by generating synthetic images with fine-granular control. We propose a novel framework that combines the strengths of both approaches.
arXiv Detail & Related papers (2023-09-23T22:41:08Z)
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation. We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z)
AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style [5.912209564607099]
We propose a method for attribute controlled image synthesis from layout. We extend a state-of-the-art approach for layout-to-image generation to condition individual objects on attributes. Our results show that our method can successfully control the fine-grained details of individual objects when modelling complex scenes with multiple objects.
arXiv Detail & Related papers (2021-03-25T10:09:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.