What the DAAM: Interpreting Stable Diffusion Using Cross Attention
- URL: http://arxiv.org/abs/2210.04885v3
- Date: Thu, 13 Oct 2022 02:00:54 GMT
- Title: What the DAAM: Interpreting Stable Diffusion Using Cross Attention
- Authors: Raphael Tang, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar,
Jimmy Lin, Ferhan Ture
- Abstract summary: Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation.
They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature.
We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork.
We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
- Score: 39.97805685586423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale diffusion neural networks represent a substantial milestone in
text-to-image generation, with some performing similar to real photographs in
human evaluation. However, they remain poorly understood, lacking
explainability and interpretability analyses, largely due to their proprietary,
closed-source nature. In this paper, to shine some much-needed light on
text-to-image diffusion models, we perform a text-image attribution analysis on
Stable Diffusion, a recently open-sourced large diffusion model. To produce
pixel-level attribution maps, we propose DAAM, a novel method based on
upscaling and aggregating cross-attention activations in the latent denoising
subnetwork. We support its correctness by evaluating its unsupervised semantic
segmentation quality on its own generated imagery, compared to supervised
segmentation models. We show that DAAM performs strongly on COCO
caption-generated images, achieving an mIoU of 61.0, and it outperforms
supervised models on open-vocabulary segmentation, for an mIoU of 51.5. We
further find that certain parts of speech, like punctuation and conjunctions,
influence the generated imagery most, which agrees with the prior literature,
while determiners and numerals the least, suggesting poor numeracy. To our
knowledge, we are the first to propose and study word-pixel attribution for
interpreting large-scale diffusion models. Our code and data are at
https://github.com/castorini/daam.
Related papers
- Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models [5.865983529245793]
TextDiff improves semantic representation through inexpensive medical text annotations.
We show that TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples.
arXiv Detail & Related papers (2024-07-07T10:21:08Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.