Related papers: Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

URL: http://arxiv.org/abs/2401.02126v1
Date: Thu, 4 Jan 2024 08:21:30 GMT
Title: Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance
Authors: Jiacheng Wang, Ping Liu, Wei Xu
Abstract summary: We present a versatile image editing framework capable of executing both rigid and non-rigid edits. We leverage a dual-path injection scheme to handle diverse editing scenarios. We introduce an integrated self-attention mechanism for fusion of appearance and structural information.
Score: 15.130419159003816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing text-to-image editing methods tend to excel either in rigid or non-rigid editing but encounter challenges when combining both, resulting in misaligned outputs with the provided text prompts. In addition, integrating reference images for control remains challenging. To address these issues, we present a versatile image editing framework capable of executing both rigid and non-rigid edits, guided by either textual prompts or reference images. We leverage a dual-path injection scheme to handle diverse editing scenarios and introduce an integrated self-attention mechanism for fusion of appearance and structural information. To mitigate potential visual artifacts, we further employ latent fusion techniques to adjust intermediate latents. Compared to previous work, our approach represents a significant advance in achieving precise and versatile image editing. Comprehensive experiments validate the efficacy of our method, showcasing competitive or superior results in text-based editing and appearance transfer tasks, encompassing both rigid and non-rigid settings.

Related papers

TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z)
Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions [20.617718631292696]
We develop a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs.<n>Our approach introduces a multi-scale learnable region to localize and guide the editing process.<n>By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing.
arXiv Detail & Related papers (2025-05-25T22:40:59Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC) CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z)
Prompt Augmentation for Self-supervised Text-guided Image Manipulation [34.01939157351624]
We introduce prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. New losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-17T16:54:05Z)
Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks. ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance [13.535394339438428]
Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. We propose a zero-shot image editing method, named textbfEnhance textbfEditability for text-based image textbfEditing via textbfCLIP guidance.
arXiv Detail & Related papers (2024-03-15T09:26:48Z)
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years. We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing. Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z)
Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z)
LayerDiffusion: Layered Controlled Image Editing with Diffusion Models [5.58892860792971]
LayerDiffusion is a semantic-based layered controlled image editing method. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy. Experimental results demonstrate the effectiveness of our method in generating highly coherent images.
arXiv Detail & Related papers (2023-05-30T01:26:41Z)
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z)
Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting. We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.