Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
- URL: http://arxiv.org/abs/2504.05594v1
- Date: Tue, 08 Apr 2025 01:02:50 GMT
- Title: Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
- Authors: Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang,
- Abstract summary: We introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization.<n>We develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment.<n>Our approach achieves a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods.
- Score: 60.82962950960996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.
Related papers
- LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing [0.276240219662896]
We introduce LORE, a training-free and efficient image editing method.<n>LORE directly optimize the inverted noise, addressing the core limitations in generalization and controllability of existing approaches.<n> Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity.
arXiv Detail & Related papers (2025-08-05T06:45:04Z) - Stable Score Distillation [45.48460025487433]
We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process.<n>Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and text-driven style edits, with faster convergence and reduced complexity.
arXiv Detail & Related papers (2025-07-12T07:14:00Z) - Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling [1.8876415010297893]
In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved.<n>We propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability.<n> Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability.
arXiv Detail & Related papers (2025-06-26T06:46:03Z) - Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z) - DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing [48.819599672346136]
DisProtEdit is a controllable protein editing framework that leverages dual-channel natural language supervision.<n>DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control.
arXiv Detail & Related papers (2025-06-17T06:12:18Z) - Training-Free Text-Guided Image Editing with Visual Autoregressive Model [46.201510044410995]
We propose a novel text-guided image editing framework based on Visual AutoRegressive modeling.<n>Our method eliminates the need for explicit inversion while ensuring precise and controlled modifications.<n>Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds.
arXiv Detail & Related papers (2025-03-31T09:46:56Z) - DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z) - UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.<n>Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)<n> CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing [66.48853049746123]
We analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps.<n>Our method effectively minimizes distortions caused by varying text conditions during noise prediction.<n> Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios.
arXiv Detail & Related papers (2024-11-29T12:11:28Z) - Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images.
Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding.
We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties.
Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance [13.535394339438428]
Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications.
We propose a zero-shot image editing method, named textbfEnhance textbfEditability for text-based image textbfEditing via textbfCLIP guidance.
arXiv Detail & Related papers (2024-03-15T09:26:48Z) - Doubly Abductive Counterfactual Inference for Text-based Image Editing [130.46583155383735]
We study text-based image editing (TBIE) of a single image by counterfactual inference.
We propose a Doubly Abductive Counterfactual inference framework (DAC)
Our DAC achieves a good trade-off between editability and fidelity.
arXiv Detail & Related papers (2024-03-05T13:59:21Z) - BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion [36.59406959595952]
We propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model.
Our method contains three novelties: (I) Targettext Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence; (II) Progressive Transition Scheme applies progressive linear approaches between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability; (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics
arXiv Detail & Related papers (2023-12-09T07:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.