Towards Efficient Diffusion-Based Image Editing with Instant Attention
Masks
- URL: http://arxiv.org/abs/2401.07709v2
- Date: Tue, 23 Jan 2024 11:22:03 GMT
- Title: Towards Efficient Diffusion-Based Image Editing with Instant Attention
Masks
- Authors: Siyu Zou, Jiji Tang, Yiyi Zhou, Jing He, Chaoyi Zhao, Rongsheng Zhang,
Zhipeng Hu, Xiaoshuai Sun
- Abstract summary: In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit)
In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps.
To supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods.
- Score: 43.079272743475435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which
often applies a semantic mask to control the target area for diffusion-based
editing. However, most existing solutions obtain these masks via manual
operations or off-line processing, greatly reducing their efficiency. In this
paper, we propose a novel and efficient image editing method for Text-to-Image
(T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In
particular, InstDiffEdit aims to employ the cross-modal attention ability of
existing diffusion models to achieve instant mask guidance during the diffusion
steps. To reduce the noise of attention maps and realize the full automatics,
we equip InstDiffEdit with a training-free refinement scheme to adaptively
aggregate the attention distributions for the automatic yet accurate mask
generation. Meanwhile, to supplement the existing evaluations of DIE, we
propose a new benchmark called Editing-Mask to examine the mask accuracy and
local editing ability of existing methods. To validate InstDiffEdit, we also
conduct extensive experiments on ImageNet and Imagen, and compare it with a
bunch of the SOTA methods. The experimental results show that InstDiffEdit not
only outperforms the SOTA methods in both image quality and editing results,
but also has a much faster inference speed, i.e., +5 to +6 times.
Related papers
- Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - Streamlining Image Editing with Layered Diffusion Brushes [8.738398948669609]
Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU.
Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation.
arXiv Detail & Related papers (2024-05-01T04:30:03Z) - Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing [2.5602836891933074]
A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image.
Current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image.
We introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $eta$ in the DDIM sampling equation for enhanced editability.
arXiv Detail & Related papers (2024-03-14T15:07:36Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion
Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input.
Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps.
We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion
Inference [36.73121523987844]
We introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing.
FISEdit uses semantic mapping between the minor modifications on the input text and the affected regions on the output image.
For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process.
arXiv Detail & Related papers (2023-05-27T09:14:03Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Blended Latent Diffusion [18.043090347648157]
We present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask.
Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
arXiv Detail & Related papers (2022-06-06T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.