Related papers: DiffUHaul: A Training-Free Method for Object Dragging in Images

DiffUHaul: A Training-Free Method for Object Dragging in Images

URL: http://arxiv.org/abs/2406.01594v2
Date: Sun, 8 Sep 2024 06:31:22 GMT
Title: DiffUHaul: A Training-Free Method for Object Dragging in Images
Authors: Omri Avrahami, Rinon Gal, Gal Chechik, Ohad Fried, Dani Lischinski, Arash Vahdat, Weili Nie,
Abstract summary: We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
Score: 78.93531472479202
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.

Related papers

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models [78.90023746996302]
Add-it is a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Human evaluations show that Add-it is preferred in over 80% of cases.
arXiv Detail & Related papers (2024-11-11T18:50:09Z)
InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning [31.799923647356458]
We propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
arXiv Detail & Related papers (2024-06-14T12:31:48Z)
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing [47.71851180196975]
tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. We conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image.
arXiv Detail & Related papers (2024-03-06T03:32:56Z)
Localizing Object-level Shape Variations with Text-to-Image Diffusion Models [60.422435066544814]
We present a technique to generate a collection of images that depicts variations in the shape of a specific object. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers.
arXiv Detail & Related papers (2023-03-20T17:45:08Z)
Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images. We achieve our goal by leveraging and combining a pretrained language-image model (CLIP) To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z)
Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization. We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning. Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z)
Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning. Current contrastive models are ineffective at localizing the foreground object. We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.