HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads
- URL: http://arxiv.org/abs/2411.15034v1
- Date: Fri, 22 Nov 2024 16:08:03 GMT
- Title: HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads
- Authors: Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee,
- Abstract summary: Head is a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs.
We present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression.
- Score: 39.94688771600168
- License:
- Abstract: Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.
Related papers
- Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation.
We show how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images.
We propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-11-12T21:34:30Z) - Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation.
We investigate how text and image latents individually and jointly contribute to the semantics of generated images.
We propose a simple and effective Extract-Manipulate-Sample framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-08-23T19:00:52Z) - Exploring Text-Guided Single Image Editing for Remote Sensing Images [30.23541304590692]
This paper proposes a text-guided RSI editing method that is controllable but stable, and can be trained using only a single image.
It adopts a multi-scale training approach to preserve consistency without the need for training on extensive benchmark datasets.
arXiv Detail & Related papers (2024-05-09T13:45:04Z) - DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve.
It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image.
It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - TD-GEM: Text-Driven Garment Editing Mapper [15.121103742607383]
We propose a Text-Driven Garment Editing Mapper (TD-GEM) to edit fashion items in a disentangled way.
An optimization-based Contrastive Language-Image Pre-training is then utilized to guide the latent representation of a fashion image.
Our TD-GEM manipulates the image accurately according to the target attribute expressed in terms of a text prompt.
arXiv Detail & Related papers (2023-05-29T14:31:54Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.