Related papers: Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

URL: http://arxiv.org/abs/2311.18435v2
Date: Fri, 29 Nov 2024 04:23:28 GMT
Title: Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis
Authors: Zipeng Qi, Guoxi Huang, Chenyang Liu, Fei Ye,
Abstract summary: This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.<n>We first introduce vision guidance as a foundational spatial cue within the perturbed distribution.<n>We propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers.
Score: 15.76266032768078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We first introduce vision guidance as a foundational spatial cue within the perturbed distribution. This significantly refines the search space in a zero-shot paradigm to focus on the image sampling process adhering to the spatial layout conditions. To precisely control the spatial layouts of multiple visual concepts with the employment of vision guidance, we propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches while allowing for more coherent and contextually accurate image synthesis. The proposed method offers a more efficient and accurate means of synthesising images that align with specific layout and contextual requirements. Through experiments, we demonstrate that our method outperforms existing techniques, both quantitatively and qualitatively, in two specific layout-to-image tasks: bounding box-to-image and instance maskto-image. Furthermore, we extend the proposed framework to enable spatially controllable editing

Related papers

ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints [13.2441524021269]
ShapeShift is a text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations. We introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur. Our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt.
arXiv Detail & Related papers (2025-03-18T20:48:58Z)
AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing [14.543341303789445]
We propose a novel mask-free point-based image editing method, AdaptiveDrag, which generates images that better align with user intent. To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization. Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs.
arXiv Detail & Related papers (2024-10-16T15:59:02Z)
Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase. We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships. Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z)
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image. We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation. Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z)
Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis. We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z)
Semantic Layout Manipulation with High-Resolution Sparse Attention [106.59650698907953]
We tackle the problem of semantic image layout manipulation, which aims to manipulate an input image by editing its semantic label map. A core problem of this task is how to transfer visual details from the input images to the new semantic layout while making the resulting image visually realistic. We propose a high-resolution sparse attention module that effectively transfers visual details to new layouts at a resolution up to 512x512.
arXiv Detail & Related papers (2020-12-14T06:50:43Z)
Style Intervention: How to Achieve Spatial Disentanglement with Style-based Generators? [100.60938767993088]
We propose a lightweight optimization-based algorithm which could adapt to arbitrary input images and render natural translation effects under flexible objectives. We verify the performance of the proposed framework in facial attribute editing on high-resolution images, where both photo-realism and consistency are required.
arXiv Detail & Related papers (2020-11-19T07:37:31Z)
TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation. We provide a carefully designed two-stream generative model with newly proposed feature transformations. This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network. A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.