Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models
- URL: http://arxiv.org/abs/2212.08698v1
- Date: Fri, 16 Dec 2022 19:58:52 GMT
- Title: Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models
- Authors: Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong
Yu, Zhe Lin, Yang Zhang, Shiyu Chang
- Abstract summary: A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
- Score: 60.63556257324894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models have been widely studied in computer vision. Recently,
diffusion models have drawn substantial attention due to the high quality of
their generated images. A key desired property of image generative models is
the ability to disentangle different attributes, which should enable
modification towards a style without changing the semantic content, and the
modification parameters should generalize to different images. Previous studies
have found that generative adversarial networks (GANs) are inherently endowed
with such disentanglement capability, so they can perform disentangled image
editing without re-training or fine-tuning the network. In this work, we
explore whether diffusion models are also inherently equipped with such a
capability. Our finding is that for stable diffusion models, by partially
changing the input text embedding from a neutral description (e.g., "a photo of
person") to one with style (e.g., "a photo of person with smile") while fixing
all the Gaussian random noises introduced during the denoising process, the
generated images can be modified towards the target style without changing the
semantic content. Based on this finding, we further propose a simple,
light-weight image editing algorithm where the mixing weights of the two text
embeddings are optimized for style matching and content preservation. This
entire process only involves optimizing over around 50 parameters and does not
fine-tune the diffusion model itself. Experiments show that the proposed method
can modify a wide range of attributes, with the performance outperforming
diffusion-model-based image-editing algorithms that require fine-tuning. The
optimized weights generalize well to different images. Our code is publicly
available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.
Related papers
- Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations [32.892042877725125]
Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image.
We show that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations.
We propose a new pretraining strategy to generate image variations using a large collection of image pairs.
arXiv Detail & Related papers (2024-05-23T17:58:03Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - SINE: SINgle Image Editing with Text-to-Image Diffusion Models [10.67527134198167]
This work aims to address the problem of single-image editing.
We propose a novel model-based guidance built upon the classifier-free guidance.
We show promising editing capabilities, including changing style, content addition, and object manipulation.
arXiv Detail & Related papers (2022-12-08T18:57:13Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Diffusion Visual Counterfactual Explanations [51.077318228247925]
Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image.
Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts.
In this paper, we overcome this by generating Visual Diffusion Counterfactual Explanations (DVCEs) for arbitrary ImageNet classifiers.
arXiv Detail & Related papers (2022-10-21T09:35:47Z) - Encoding Robustness to Image Style via Adversarial Feature Perturbations [72.81911076841408]
We adapt adversarial training by directly perturbing feature statistics, rather than image pixels, to produce robust models.
Our proposed method, Adversarial Batch Normalization (AdvBN), is a single network layer that generates worst-case feature perturbations during training.
arXiv Detail & Related papers (2020-09-18T17:52:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.