Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing
- URL: http://arxiv.org/abs/2504.04784v1
- Date: Mon, 07 Apr 2025 07:26:25 GMT
- Title: Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing
- Authors: Hui Liu, Bin Zou, Suiyun Zhang, Kecheng Chen, Rui Liu, Haoliang Li,
- Abstract summary: Instruction Influence Disentanglement (IID) is a novel framework enabling parallel execution of multiple instructions in a single denoising process.<n>We analyze self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.<n>IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines.
- Score: 26.02149948089938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.
Related papers
- Training-Free Text-Guided Image Editing with Visual Autoregressive Model [46.201510044410995]
We propose a novel text-guided image editing framework based on Visual AutoRegressive modeling.
Our method eliminates the need for explicit inversion while ensuring precise and controlled modifications.
Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds.
arXiv Detail & Related papers (2025-03-31T09:46:56Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.<n>FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.<n>Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.
We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.
Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)
CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing [24.316956641791034]
We propose a zero-shot inference pipeline for diffusion-based editing systems.
We use a large language model (LLM) to decompose the input instruction into specific instructions.
Our pipeline improves the interpretability of editing models, and boosts the output diversity.
arXiv Detail & Related papers (2024-07-29T17:59:57Z) - InstructEdit: Instruction-based Knowledge Editing for Large Language Models [39.2147118489123]
We develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions.
Experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines.
arXiv Detail & Related papers (2024-02-25T15:46:33Z) - DECap: Towards Generalized Explicit Caption Editing via Diffusion
Mechanism [17.03837136771538]
We propose Diffusion-based Explicit Caption editing method: DECap.
We reformulate the ECE task as a denoising process under the diffusion mechanism.
The denoising process involves the explicit predictions of edit operations and corresponding content words.
arXiv Detail & Related papers (2023-11-25T03:52:03Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - Exploring Format Consistency for Instruction Tuning [79.0698403613366]
In this work, we propose a framework named Unified Instruction Tuning (UIT)
UIT calls OpenAI APIs for automatic format transfer among different instruction tuning datasets such as PromptSource, FLAN and CrossFit.
With the framework, we demonstrate the necessity of maintaining format consistency in instruction tuning; (2) improve the generalization performance on unseen instructions on T5-LM-xl; and (3) provide a novel perplexity-based denoising method to reduce the noise of automatic format transfer.
arXiv Detail & Related papers (2023-07-28T12:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.