MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image
Synthesis and Editing
- URL: http://arxiv.org/abs/2304.08465v1
- Date: Mon, 17 Apr 2023 17:42:19 GMT
- Title: MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image
Synthesis and Editing
- Authors: Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie,
Yinqiang Zheng
- Abstract summary: We develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously.
Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency.
Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
- Score: 54.712205852602736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success in large-scale text-to-image generation and
text-conditioned image editing, existing methods still struggle to produce
consistent generation and editing results. For example, generation approaches
usually fail to synthesize multiple images of the same objects/characters but
with different views or poses. Meanwhile, existing editing methods either fail
to achieve effective complex non-rigid editing while maintaining the overall
textures and identity, or require time-consuming fine-tuning to capture the
image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free
method to achieve consistent image generation and complex non-rigid image
editing simultaneously. Specifically, MasaCtrl converts existing self-attention
in diffusion models into mutual self-attention, so that it can query correlated
local contents and textures from source images for consistency. To further
alleviate the query confusion between foreground and background, we propose a
mask-guided mutual self-attention strategy, where the mask can be easily
extracted from the cross-attention maps. Extensive experiments show that the
proposed MasaCtrl can produce impressive results in both consistent image
generation and complex non-rigid real image editing.
Related papers
- Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.
We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images.
We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - LayerDiffusion: Layered Controlled Image Editing with Diffusion Models [5.58892860792971]
LayerDiffusion is a semantic-based layered controlled image editing method.
We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy.
Experimental results demonstrate the effectiveness of our method in generating highly coherent images.
arXiv Detail & Related papers (2023-05-30T01:26:41Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a
Single Image [2.999198565272416]
We make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image.
We propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image.
We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
arXiv Detail & Related papers (2022-10-17T23:46:05Z) - Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.