Related papers: LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

URL: http://arxiv.org/abs/2403.11929v1
Date: Mon, 18 Mar 2024 16:28:28 GMT
Title: LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
Authors: Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu,
Abstract summary: Layer-collaborative diffusion model, named LayerDiff, is designed for text-guided, multi-layered, composable image synthesis. Our model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
Score: 70.14953942532621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.

Related papers

PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment [24.964578950380947]
PSDiffusion is a unified diffusion framework for simultaneous multi-layer text-to-image generation.<n>Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds.<n>Our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively.
arXiv Detail & Related papers (2025-05-16T17:23:35Z)
DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode [47.32061459437175]
We introduce DreamLayer, a framework that enables coherent text-driven generation of multiple image layers. By explicitly modeling the relationship between transparent foreground and background layers, DreamLayer builds inter-layer connections. Experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers.
arXiv Detail & Related papers (2025-03-17T05:34:11Z)
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation [108.69315278353932]
We introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
arXiv Detail & Related papers (2025-02-25T16:57:04Z)
LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge [14.481577976493236]
LayeringDiff is a novel pipeline for the synthesis of layered images. By extracting layers from a composite image, rather than generating them from scratch, LayeringDiff bypasses the need for large-scale training. For effective layer decomposition, we adapt a large-scale pretrained generative prior to estimate foreground and background layers.
arXiv Detail & Related papers (2025-01-02T11:18:25Z)
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors [38.47462111828742]
Layered content generation is crucial for creative fields like graphic design, animation, and digital art. We propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers. We show significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.
arXiv Detail & Related papers (2024-12-05T18:59:18Z)
Generative Image Layer Decomposition with Visual Effects [49.75021036203426]
LayerDecomp is a generative framework for image layer decomposition. It produces clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks.
arXiv Detail & Related papers (2024-11-26T20:26:49Z)
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions. MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images. We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z)
Text2Layer: Layered Image Generation using Latent Diffusion Model [12.902259486204898]
We propose to generate layered images from a layered image generation perspective. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images. Experimental results show that the proposed method is able to generate high-quality layered images.
arXiv Detail & Related papers (2023-07-19T06:56:07Z)
Collage Diffusion [17.660410448312717]
Collage Diffusion harmonizes the input layers to make objects fit together. We preserve key visual attributes of input layers by learning specialized text representations per layer. Collage Diffusion generates globally harmonized images that maintain desired object characteristics better than prior approaches.
arXiv Detail & Related papers (2023-03-01T06:35:42Z)
MontageGAN: Generation and Assembly of Multiple Components by GANs [11.117357750374035]
We propose MontageGAN, which is a Generative Adversarial Networks framework for generating multi-layer images. Our method utilized a two-step approach consisting of local GANs and global GAN.
arXiv Detail & Related papers (2022-05-31T07:34:19Z)
SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting [54.419266357283966]
Single image 3D photography enables viewers to view a still image from novel viewpoints. Recent approaches combine monocular depth networks with inpainting networks to achieve compelling results. We present SLIDE, a modular and unified system for single image 3D photography.
arXiv Detail & Related papers (2021-09-02T16:37:20Z)
Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs [8.528384027684192]
We propose a class- and layer-wise extension to the variational autoencoder framework that allows flexible control over each object class at the local to global levels. We demonstrate that our method generates images that are both plausible and more diverse compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-06-25T04:12:05Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
Deep Image Compositing [93.75358242750752]
We propose a new method which can automatically generate high-quality image composites without any user input. Inspired by Laplacian pyramid blending, a dense-connected multi-stream fusion network is proposed to effectively fuse the information from the foreground and background images. Experiments show that the proposed method can automatically generate high-quality composites and outperforms existing methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-11-04T06:12:24Z)
Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.