Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing
- URL: http://arxiv.org/abs/2411.08196v1
- Date: Tue, 12 Nov 2024 21:34:30 GMT
- Title: Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing
- Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen,
- Abstract summary: Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation.
We show how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images.
We propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.
- Score: 4.948910649137149
- License:
- Abstract: Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT's disentanglement properties and effectiveness of the EIM framework.
Related papers
- Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation.
We investigate how text and image latents individually and jointly contribute to the semantics of generated images.
We propose a simple and effective Extract-Manipulate-Sample framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-08-23T19:00:52Z) - Zero-shot Inversion Process for Image Attribute Editing with Diffusion
Models [9.924851219904843]
We propose a framework that injects a fusion of generated visual reference and text guidance into the semantic latent space of a pre-trained diffusion model.
Only using a tiny neural network, the proposed ZIP produces diverse content and attributes under the intuitive control of the text prompt.
Compared to state-of-the-art methods, ZIP produces images of equivalent quality while providing a realistic editing effect.
arXiv Detail & Related papers (2023-08-30T08:40:15Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Entity-Level Text-Guided Image Manipulation [70.81648416508867]
We study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM)
We propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images.
In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated.
In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions.
arXiv Detail & Related papers (2023-02-22T13:56:23Z) - Towards Arbitrary Text-driven Image Manipulation via Space Alignment [49.3370305074319]
We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
arXiv Detail & Related papers (2023-01-25T16:20:01Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Layout-Bridging Text-to-Image Synthesis [20.261873143881573]
We push for effective modeling in both text-to-image generation and layout-to-image synthesis.
We focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesis process.
arXiv Detail & Related papers (2022-08-12T08:21:42Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Delta-GAN-Encoder: Encoding Semantic Changes for Explicit Image Editing,
using Few Synthetic Samples [2.348633570886661]
We propose a novel method for learning to control any desired attribute in a pre-trained GAN's latent space.
We perform Sim2Real learning, relying on minimal samples to achieve an unlimited amount of continuous precise edits.
arXiv Detail & Related papers (2021-11-16T12:42:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.