Localizing and Editing Knowledge in Text-to-Image Generative Models
- URL: http://arxiv.org/abs/2310.13730v1
- Date: Fri, 20 Oct 2023 17:31:12 GMT
- Title: Localizing and Editing Knowledge in Text-to-Image Generative Models
- Authors: Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun
Manjunatha
- Abstract summary: knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet.
We introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models.
- Score: 62.02776252311559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have
achieved unprecedented quality of photorealism with state-of-the-art FID scores
on MS-COCO and other generation benchmarks. Given a caption, image generation
requires fine-grained knowledge about attributes such as object structure,
style, and viewpoint amongst others. Where does this information reside in
text-to-image generative models? In our paper, we tackle this question and
understand how knowledge corresponding to distinct visual attributes is stored
in large-scale text-to-image diffusion models. We adapt Causal Mediation
Analysis for text-to-image models and trace knowledge about distinct visual
attributes to various (causal) components in the (i) UNet and (ii) text-encoder
of the diffusion model. In particular, we show that unlike generative
large-language models, knowledge about different attributes is not localized in
isolated components, but is instead distributed amongst a set of components in
the conditional UNet. These sets of components are often distinct for different
visual attributes. Remarkably, we find that the CLIP text-encoder in public
text-to-image models such as Stable-Diffusion contains only one causal state
across different visual attributes, and this is the first self-attention layer
corresponding to the last subject token of the attribute in the caption. This
is in stark contrast to the causal states in other language models which are
often the mid-MLP layers. Based on this observation of only one causal state in
the text-encoder, we introduce a fast, data-free model editing method
Diff-QuickFix which can effectively edit concepts in text-to-image models.
DiffQuickFix can edit (ablate) concepts in under a second with a closed-form
update, providing a significant 1000x speedup and comparable editing
performance to existing fine-tuning based editing methods.
Related papers
- On Mechanistic Knowledge Localization in Text-to-Image Generative Models [44.208804082687294]
We introduce the concept of Mechanistic Localization in text-to-image models.
We measure the direct effect of intermediate layers to output generation by performing interventions in the cross-attention layers of the UNet.
We employ LocoEdit, a fast closed-form editing method across popular open-source text-to-image models.
arXiv Detail & Related papers (2024-05-02T05:19:05Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - DiffUTE: Universal Text Editing Diffusion Model [32.384236053455]
We propose a universal self-supervised text editing diffusion model (DiffUTE)
It aims to replace or modify words in the source image with another one while maintaining its realistic appearance.
Our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.
arXiv Detail & Related papers (2023-05-18T09:06:01Z) - PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor [135.17302411419834]
PAIR Diffusion is a generic framework that enables a diffusion model to control the structure and appearance of each object in the image.
We show that having control over the properties of each object in an image leads to comprehensive editing capabilities.
Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations.
arXiv Detail & Related papers (2023-03-30T17:13:56Z) - Editing Implicit Assumptions in Text-to-Image Diffusion Models [48.542005079915896]
Text-to-image diffusion models often make implicit assumptions about the world when generating images.
In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model.
Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second.
arXiv Detail & Related papers (2023-03-14T17:14:21Z) - PRedItOR: Text Guided Image Editing with Diffusion Prior [2.3022070933226217]
Text guided image editing requires compute intensive optimization of text embeddings or fine-tuning the model weights for text guided image editing.
Our architecture consists of a diffusion prior model that generates CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion Model trained to generate images conditioned on CLIP image embedding.
We combine this with structure preserving edits on the image decoder using existing approaches such as reverse DDIM to perform text guided image editing.
arXiv Detail & Related papers (2023-02-15T22:58:11Z) - ManiCLIP: Multi-Attribute Face Manipulation from Text [104.30600573306991]
We present a novel multi-attribute face manipulation method based on textual descriptions.
Our method generates natural manipulated faces with minimal text-irrelevant attribute editing.
arXiv Detail & Related papers (2022-10-02T07:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.