Move Anything with Layered Scene Diffusion
- URL: http://arxiv.org/abs/2404.07178v1
- Date: Wed, 10 Apr 2024 17:28:16 GMT
- Title: Move Anything with Layered Scene Diffusion
- Authors: Jiawei Ren, Mengmeng Xu, Jui-Chieh Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul,
- Abstract summary: We propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process.
Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts.
Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations.
- Score: 77.45870343845492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.
Related papers
- DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Editable Image Elements for Controllable Synthesis [79.58148778509769]
We propose an image representation that promotes spatial editing of input images using a diffusion model.
We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition.
arXiv Detail & Related papers (2024-04-24T17:59:11Z) - Motion Guidance: Diffusion-Based Image Editing with Differentiable
Motion Estimators [19.853978560075305]
Motion guidance is a technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move.
We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.
arXiv Detail & Related papers (2024-01-31T18:59:59Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Spatial Steerability of GANs via Self-Supervision from Discriminator [123.27117057804732]
We propose a self-supervised approach to improve the spatial steerability of GANs without searching for steerable directions in the latent space.
Specifically, we design randomly sampled Gaussian heatmaps to be encoded into the intermediate layers of generative models as spatial inductive bias.
During inference, users can interact with the spatial heatmaps in an intuitive manner, enabling them to edit the output image by adjusting the scene layout, moving, or removing objects.
arXiv Detail & Related papers (2023-01-20T07:36:29Z) - BlobGAN: Spatially Disentangled Scene Representations [67.60387150586375]
We propose an unsupervised, mid-level representation for a generative model of scenes.
The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features.
arXiv Detail & Related papers (2022-05-05T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.