ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
- URL: http://arxiv.org/abs/2510.17803v1
- Date: Mon, 20 Oct 2025 17:59:52 GMT
- Title: ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
- Authors: Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai,
- Abstract summary: We propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT.<n>It incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens.<n>It achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios.
- Score: 17.162316662697965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
Related papers
- O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing [88.93410369258203]
O-DisCo-Edit is a unified framework that incorporates a novel object distortion control (O-DisCo)<n>This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation.<n>O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm.
arXiv Detail & Related papers (2025-09-01T16:29:39Z) - CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion [24.144486805878596]
CoreEditor is a novel framework for consistent text-to-3D editing.<n>We introduce a correspondence-constrained attention mechanism that enforces precise interactions between pixels.<n>In experiments, CoreEditor produces high-quality, 3D-consistent edits with sharper details.
arXiv Detail & Related papers (2025-08-15T17:13:11Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - Improving Editability in Image Generation with Layer-wise Memory [23.004027029130953]
Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing.<n>We propose enabling rough mask inputs that preserve existing content while naturally integrating new elements.<n>Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits.
arXiv Detail & Related papers (2025-05-02T07:36:49Z) - Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model [60.82962950960996]
We introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization.<n>We develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment.<n>Our approach achieves a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods.
arXiv Detail & Related papers (2025-04-08T01:02:50Z) - MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views.
We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images.
We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.