Group Relative Attention Guidance for Image Editing
- URL: http://arxiv.org/abs/2510.24657v1
- Date: Tue, 28 Oct 2025 17:22:44 GMT
- Title: Group Relative Attention Guidance for Image Editing
- Authors: Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu,
- Abstract summary: Group Relative Attention Guidance (GRAG) is a simple yet effective method that modulates the focus of the model on the input image relative to the editing instruction.<n>Our code will be released at https://www.littlemisfit.com/little-misfit/GRAG-Image-Editing.
- Score: 38.299491082179905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
Related papers
- SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control [50.76070785417023]
We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control.<n>Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider.<n>Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
arXiv Detail & Related papers (2025-11-12T20:21:37Z) - Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing [76.44219733285898]
Kontinuous Kontext is an instruction-driven editing model that provides a new dimension of control over edit strength.<n>A lightweight projector network maps the input scalar and the edit instruction to coefficients in the model's modulation space.<n>For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models.
arXiv Detail & Related papers (2025-10-09T17:51:03Z) - SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder [52.754326452329956]
We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings.<n>The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute.<n>Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image backbones.
arXiv Detail & Related papers (2025-10-06T17:51:04Z) - Visual Autoregressive Modeling for Instruction-Guided Image Editing [97.04821896251681]
We present a visual autoregressive framework that reframes image editing as a next-scale prediction problem.<n>VarEdit generates multi-scale target features to achieve precise edits.<n>It completes a $512times512$ editing in 1.2 seconds, making it 2.2$times$ faster than the similarly sized UltraEdit.
arXiv Detail & Related papers (2025-08-21T17:59:32Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance [0.0]
LEDITS is a combined lightweight approach for real-image editing, incorporating the Edit Friendly DDPM inversion technique with Semantic Guidance.
This approach achieves versatile edits, both subtle and extensive as well as alterations in composition and style, while requiring no optimization nor extensions to the architecture.
arXiv Detail & Related papers (2023-07-02T09:11:09Z) - EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing.
We show that EditGAN can manipulate images with an unprecedented level of detail and freedom.
We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.