BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion
- URL: http://arxiv.org/abs/2312.05482v1
- Date: Sat, 9 Dec 2023 07:18:23 GMT
- Title: BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion
- Authors: Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu,
Guo-Jun Qi
- Abstract summary: We propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model.
Our method contains three novelties: (I) Targettext Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence; (II) Progressive Transition Scheme applies progressive linear approaches between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability; (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics
- Score: 36.59406959595952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image editing approaches with diffusion models have been rapidly developed,
yet their applicability are subject to requirements such as specific editing
types (e.g., foreground or background object editing, style transfer), multiple
conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of
diffusion models. For alleviating these limitations and realizing efficient
real image editing, we propose a novel editing technique that only requires an
input image and target text for various editing types including non-rigid edits
without fine-tuning diffusion model. Our method contains three novelties:(I)
Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target
text embedding to achieve fast image reconstruction without image caption and
acceleration of convergence.(II) Progressive Transition Scheme applies
progressive linear interpolation between target text embedding and its
fine-tuned version to generate transition embedding for maintaining non-rigid
editing capability.(III) Balanced Attention Module (BAM) balances the tradeoff
between textual description and image semantics.By the means of combining
self-attention map from reconstruction process and cross-attention map from
transition process, the guidance of target text embeddings in diffusion process
is optimized.In order to demonstrate editing capability, effectiveness and
efficiency of the proposed BARET, we have conducted extensive qualitative and
quantitative experiments. Moreover, results derived from user study and
ablation study further prove the superiority over other methods.
Related papers
- Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing [28.904419606450876]
We present a Vision-guided and Mask-enhanced Adaptive Editing (ViMAEdit) method with three key novel designs.
First, we propose to leverage image embeddings as explicit guidance to enhance the conventional textual prompt-based denoising process.
Second, we devise a self-attention-guided iterative editing area grounding strategy.
arXiv Detail & Related papers (2024-10-14T13:41:37Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing [2.5602836891933074]
A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image.
Current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image.
We introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $eta$ in the DDIM sampling equation for enhanced editability.
arXiv Detail & Related papers (2024-03-14T15:07:36Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion
Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input.
Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps.
We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code [19.988947272980848]
"Direct Inversion" is a novel technique achieving optimal performance of both branches with just three lines of code.
We present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types.
Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.
arXiv Detail & Related papers (2023-10-02T18:01:55Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.