Schr\"{o}dinger's Bat: Diffusion Models Sometimes Generate Polysemous
Words in Superposition
- URL: http://arxiv.org/abs/2211.13095v1
- Date: Wed, 23 Nov 2022 16:26:49 GMT
- Title: Schr\"{o}dinger's Bat: Diffusion Models Sometimes Generate Polysemous
Words in Superposition
- Authors: Jennifer C. White, Ryan Cotterell
- Abstract summary: Recent work has shown that text-to-image diffusion models can display strange behaviours when a prompt contains a word with multiple possible meanings.
We show that when given an input that is the sum of encodings of two distinct words, the model can produce an image containing both concepts represented in the sum.
We then demonstrate that the CLIP encoder used to encode prompts encodes polysemous words as a superposition of meanings, and that using linear algebraic techniques we can edit these representations to influence the senses represented in the generated images.
- Score: 71.45263447328374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that despite their impressive capabilities,
text-to-image diffusion models such as DALL-E 2 (Ramesh et al., 2022) can
display strange behaviours when a prompt contains a word with multiple possible
meanings, often generating images containing both senses of the word (Rassin et
al., 2022). In this work we seek to put forward a possible explanation of this
phenomenon. Using the similar Stable Diffusion model (Rombach et al., 2022), we
first show that when given an input that is the sum of encodings of two
distinct words, the model can produce an image containing both concepts
represented in the sum. We then demonstrate that the CLIP encoder used to
encode prompts (Radford et al., 2021) encodes polysemous words as a
superposition of meanings, and that using linear algebraic techniques we can
edit these representations to influence the senses represented in the generated
images. Combining these two findings, we suggest that the homonym duplication
phenomenon described by Rassin et al. (2022) is caused by diffusion models
producing images representing both of the meanings that are present in
superposition in the encoding of a polysemous word.
Related papers
- Word-Level Explanations for Analyzing Bias in Text-to-Image Models [72.71184730702086]
Text-to-image (T2I) models can generate images that underrepresent minorities based on race and sex.
This paper investigates which word in the input prompt is responsible for bias in generated images.
arXiv Detail & Related papers (2023-06-03T21:39:07Z) - DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image
Models [53.29993651680099]
We show that DALLE-2 does not follow the constraint that each word has a single role in the interpretation.
We show that DALLE-2 depicts both senses of nouns with multiple senses at once.
arXiv Detail & Related papers (2022-10-19T14:52:40Z) - Adversarial Attacks on Image Generation With Made-Up Words [0.0]
A text-guided image generation model can be prompted to generate images using nonce words adversarially designed to evoke specific visual concepts.
The implications of these techniques for the circumvention of existing approaches to content moderation are discussed.
arXiv Detail & Related papers (2022-08-04T15:10:23Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Diffusion Autoencoders: Toward a Meaningful and Decodable Representation [1.471992435706872]
Diffusion models (DPMs) have achieved remarkable quality in image generation that rivals GANs'.
Unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks.
This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding.
arXiv Detail & Related papers (2021-11-30T18:24:04Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.