Addressing Text Embedding Leakage in Diffusion-based Image Editing
- URL: http://arxiv.org/abs/2412.04715v4
- Date: Mon, 25 Aug 2025 06:53:46 GMT
- Title: Addressing Text Embedding Leakage in Diffusion-based Image Editing
- Authors: Sunung Mun, Jinhwan Nam, Sunghyun Cho, Jungseul Ok,
- Abstract summary: We introduce Attribute-Leakage-free Editing (ALE), a framework that tackles attribute leakage at its source.<n>ALE combines Object-Restricted Embeddings (ORE) to disentangle text embeddings, Region-Guided Blending for Cross-Attention Masking (RGB-CAM) for spatially precise attention, and Background Blending (BB) to preserve non-edited content.
- Score: 33.1686050396517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based image editing, powered by generative diffusion models, lets users modify images through natural-language prompts and has dramatically simplified traditional workflows. Despite these advances, current methods still suffer from a critical problem: attribute leakage, where edits meant for specific objects unintentionally affect unrelated regions or other target objects. Our analysis reveals the root cause as the semantic entanglement inherent in End-of-Sequence (EOS) embeddings generated by autoregressive text encoders, which indiscriminately aggregate attributes across prompts. To address this issue, we introduce Attribute-Leakage-free Editing (ALE), a framework that tackles attribute leakage at its source. ALE combines Object-Restricted Embeddings (ORE) to disentangle text embeddings, Region-Guided Blending for Cross-Attention Masking (RGB-CAM) for spatially precise attention, and Background Blending (BB) to preserve non-edited content. To quantitatively evaluate attribute leakage across various editing methods, we propose the Attribute-Leakage Evaluation Benchmark (ALE-Bench), featuring comprehensive editing scenarios and new metrics. Extensive experiments show that ALE reduces attribute leakage by large margins, thereby enabling accurate, multi-object, text-driven image editing while faithfully preserving non-target content.
Related papers
- Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints [2.4140502941897544]
Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute unintentionally alters other semantic properties such as identity or appearance.<n>The Predict, Prevent, and Evaluate framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing.<n> Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.
arXiv Detail & Related papers (2025-12-25T11:38:10Z) - SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder [52.754326452329956]
We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings.<n>The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute.<n>Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image backbones.
arXiv Detail & Related papers (2025-10-06T17:51:04Z) - Immunizing Images from Text to Image Editing via Adversarial Cross-Attention [17.498230426195114]
We propose a novel attack that targets the visual component of editing methods.<n>We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image.<n> Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.
arXiv Detail & Related papers (2025-09-12T15:47:50Z) - CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing [24.68304617869157]
Context-Preserving Adaptive Manipulation (CPAM) is a novel framework for complicated, non-rigid real image editing.<n>We develop a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively.<n>We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner.
arXiv Detail & Related papers (2025-06-23T09:19:38Z) - MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models [10.798205956644317]
We propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit.<n>Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.
arXiv Detail & Related papers (2025-05-08T10:01:14Z) - LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing [6.057289837472806]
Text-guided image editing aims to modify specific regions of an image according to natural language instructions.
Since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity.
We introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach.
arXiv Detail & Related papers (2025-03-27T14:32:17Z) - Lost in Edits? A $λ$-Compass for AIGC Provenance [119.95562081325552]
We propose a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones.
LambdaTracer is effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix or performed manually with editing software such as Adobe Photoshop.
arXiv Detail & Related papers (2025-02-05T06:24:25Z) - CA-Edit: Causality-Aware Condition Adapter for High-Fidelity Local Facial Attribute Editing [41.92598830147057]
A novel data utilization strategy is introduced to construct datasets consisting of attribute-text triples from a data-driven perspective.
A Skin Transition Frequency Guidance technique is introduced for the local modeling of contextual causality.
arXiv Detail & Related papers (2024-12-18T07:33:22Z) - DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting [63.01425442236011]
We present DreamMix, a diffusion-based generative model adept at inserting target objects into scenes at user-specified locations.
We propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance.
arXiv Detail & Related papers (2024-11-26T08:44:47Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing [3.852667054327356]
We introduce FlexEdit, a flexible and controllable editing framework for objects.
We iteratively adjust latents at each denoising step using our FlexEdit block.
Our framework employs an adaptive mask, automatically extracted during denoising, to protect the background.
arXiv Detail & Related papers (2024-03-27T14:24:30Z) - EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing [62.15822650722473]
Current video editing methods fail to edit the foreground and background simultaneously while preserving the original layout.
We introduce EVA, a textbfzero-shot and textbfmulti-attribute video editing framework tailored for human-centric videos with complex motions.
EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping.
arXiv Detail & Related papers (2024-03-24T12:04:06Z) - LoMOE: Localized Multi-Object Editing via Multi-Diffusion [8.90467024388923]
We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process.
Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions.
A combination of cross-attention and background losses within the latent space ensures that the characteristics of the object being edited are preserved.
arXiv Detail & Related papers (2024-03-01T10:46:47Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion Models [69.33072075580483]
This paper introduces LIME for localized image editing in diffusion models.<n>LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask.<n>We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Regularizing Self-training for Unsupervised Domain Adaptation via
Structural Constraints [14.593782939242121]
We propose to incorporate structural cues from auxiliary modalities, such as depth, to regularise conventional self-training objectives.
Specifically, we introduce a contrastive pixel-level objectness constraint that pulls the pixel representations within a region of an object instance closer.
We show that our regularizer significantly improves top performing self-training methods in various UDA benchmarks for semantic segmentation.
arXiv Detail & Related papers (2023-04-29T00:12:26Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.