Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion
Inference
- URL: http://arxiv.org/abs/2305.17423v3
- Date: Thu, 4 Jan 2024 08:10:13 GMT
- Title: Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion
Inference
- Authors: Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui
- Abstract summary: We introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing.
FISEdit uses semantic mapping between the minor modifications on the input text and the affected regions on the output image.
For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process.
- Score: 36.73121523987844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the recent success of diffusion models, text-to-image generation is
becoming increasingly popular and achieves a wide range of applications. Among
them, text-to-image editing, or continuous text-to-image generation, attracts
lots of attention and can potentially improve the quality of generated images.
It's common to see that users may want to slightly edit the generated image by
making minor modifications to their input textual descriptions for several
rounds of diffusion inference. However, such an image editing process suffers
from the low inference efficiency of many existing diffusion models even using
GPU accelerators. To solve this problem, we introduce Fast Image Semantically
Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for
efficient text-to-image editing. The key intuition behind our approach is to
utilize the semantic mapping between the minor modifications on the input text
and the affected regions on the output image. For each text editing step,
FISEdit can automatically identify the affected image regions and utilize the
cached unchanged regions' feature map to accelerate the inference process.
Extensive empirical results show that FISEdit can be $3.4\times$ and
$4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs
respectively, and even generates more satisfactory images.
Related papers
- EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM [50.054404519821745]
We present a novel framework that integrates a multimodal Large Language Model for enhanced reasoning capabilities.
Our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush datasets.
Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits.
arXiv Detail & Related papers (2024-12-05T02:05:33Z) - TurboEdit: Instant text-based image editing [32.06820085957286]
We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models.
We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image.
Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion and 4 NFEs per edit.
arXiv Detail & Related papers (2024-08-14T18:02:24Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.