EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
- URL: http://arxiv.org/abs/2512.11715v1
- Date: Fri, 12 Dec 2025 16:51:19 GMT
- Title: EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
- Authors: Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu,
- Abstract summary: Masked Generative Transformers (MGTs) exhibit a localized decoding paradigm that endows them with the inherent capacity to preserve non-relevant regions during the editing process.<n>We introduce the first MGT-based image editing framework, termed EditMGT.<n>We demonstrate that EditMGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions.<n>We also introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits.
- Score: 84.7089707244905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
Related papers
- Generative Visual Chain-of-Thought for Image Editing [48.64933075232273]
Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions.<n>To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT)<n>GVCoT performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit.
arXiv Detail & Related papers (2026-03-02T14:12:52Z) - CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning [15.375069717719157]
We present a post-training framework for Content-Consistent Editing (CoCoEdit)<n>We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set.<n>We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process.
arXiv Detail & Related papers (2026-02-15T09:36:54Z) - Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control [52.87568958372421]
Follow-Your-Shape is a training-free and mask-free framework that supports precise and controllable editing of object shapes.<n>We compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths.<n>Our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
arXiv Detail & Related papers (2025-08-11T16:10:00Z) - X-Edit: Detecting and Localizing Edits in Images Altered by Text-Guided Diffusion Models [3.610796534465868]
Experimental results demonstrate that X-Edit accurately localizes edits in images altered by text-guided diffusion models.<n>This highlights X-Edit's potential as a robust forensic tool for detecting and pinpointing manipulations introduced by advanced image editing techniques.
arXiv Detail & Related papers (2025-05-16T23:29:38Z) - DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z) - MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based
Attention-Adjusted Guidance [28.212908146852197]
We develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios.
In particular, MAG-Edit optimize the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints.
arXiv Detail & Related papers (2023-12-18T17:55:44Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion Models [69.33072075580483]
This paper introduces LIME for localized image editing in diffusion models.<n>LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask.<n>We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - Customize your NeRF: Adaptive Source Driven 3D Scene Editing via
Local-Global Iterative Training [61.984277261016146]
We propose a CustomNeRF model that unifies a text description or a reference image as the editing prompt.
To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing.
For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem.
arXiv Detail & Related papers (2023-12-04T06:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.