Shape-Guided Diffusion with Inside-Outside Attention
- URL: http://arxiv.org/abs/2212.00210v3
- Date: Mon, 1 Apr 2024 17:19:02 GMT
- Title: Shape-Guided Diffusion with Inside-Outside Attention
- Authors: Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell,
- Abstract summary: We introduce precise object silhouette as a new form of user control in text-to-image diffusion models.
Our training-free method uses an Inside-Outside Attention mechanism to apply a shape constraint to the cross- and self-attention maps.
- Score: 60.557437251084465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. background (outside) then associates edits to the correct region. We demonstrate the efficacy of our method on the shape-guided editing task, where the model must replace an object according to a text prompt and object mask. We curate a new ShapePrompts benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness without a degradation in text alignment or image realism according to both automatic metrics and annotator ratings. Our data and code will be made available at https://shape-guided-diffusion.github.io.
Related papers
- MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model [11.699591936909325]
Mask-free Training-free Object-Level Layout Control Diffusion Model (MFTF)
MFTF provides precise control over object positions without requiring additional masks or images.
arXiv Detail & Related papers (2024-12-02T08:56:13Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Learning 3D Photography Videos via Self-supervised Diffusion on Single
Images [105.81348348510551]
3D photography renders a static image into a video with appealing 3D visual effects.
Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints.
We present a novel task: out-animation, which extends the space and time of input objects.
arXiv Detail & Related papers (2023-02-21T16:18:40Z) - Intuitive Shape Editing in Latent Space [9.034665429931406]
We present an autoencoder-based method that enables intuitive shape editing in latent space by disentangling latent sub-spaces.
We evaluate our method by comparing it to state-of-the-art data-driven shape editing methods.
arXiv Detail & Related papers (2021-11-24T13:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.