Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion
Inference
- URL: http://arxiv.org/abs/2305.17423v3
- Date: Thu, 4 Jan 2024 08:10:13 GMT
- Title: Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion
Inference
- Authors: Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui
- Abstract summary: We introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing.
FISEdit uses semantic mapping between the minor modifications on the input text and the affected regions on the output image.
For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process.
- Score: 36.73121523987844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the recent success of diffusion models, text-to-image generation is
becoming increasingly popular and achieves a wide range of applications. Among
them, text-to-image editing, or continuous text-to-image generation, attracts
lots of attention and can potentially improve the quality of generated images.
It's common to see that users may want to slightly edit the generated image by
making minor modifications to their input textual descriptions for several
rounds of diffusion inference. However, such an image editing process suffers
from the low inference efficiency of many existing diffusion models even using
GPU accelerators. To solve this problem, we introduce Fast Image Semantically
Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for
efficient text-to-image editing. The key intuition behind our approach is to
utilize the semantic mapping between the minor modifications on the input text
and the affected regions on the output image. For each text editing step,
FISEdit can automatically identify the affected image regions and utilize the
cached unchanged regions' feature map to accelerate the inference process.
Extensive empirical results show that FISEdit can be $3.4\times$ and
$4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs
respectively, and even generates more satisfactory images.
Related papers
- Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - Editable Image Elements for Controllable Synthesis [79.58148778509769]
We propose an image representation that promotes spatial editing of input images using a diffusion model.
We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition.
arXiv Detail & Related papers (2024-04-24T17:59:11Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Towards Real-time Text-driven Image Manipulation with Unconditional
Diffusion Models [33.993466872389085]
We develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster.
Our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds.
arXiv Detail & Related papers (2023-04-10T01:21:56Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.