DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image
Models
- URL: http://arxiv.org/abs/2210.10606v1
- Date: Wed, 19 Oct 2022 14:52:40 GMT
- Title: DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image
Models
- Authors: Royi Rassin, Shauli Ravfogel, Yoav Goldberg
- Abstract summary: We show that DALLE-2 does not follow the constraint that each word has a single role in the interpretation.
We show that DALLE-2 depicts both senses of nouns with multiple senses at once.
- Score: 53.29993651680099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the way DALLE-2 maps symbols (words) in the prompt to their
references (entities or properties of entities in the generated image). We show
that in stark contrast to the way human process language, DALLE-2 does not
follow the constraint that each word has a single role in the interpretation,
and sometimes re-use the same symbol for different purposes. We collect a set
of stimuli that reflect the phenomenon: we show that DALLE-2 depicts both
senses of nouns with multiple senses at once; and that a given word can modify
the properties of two distinct entities in the image, or can be depicted as one
object and also modify the properties of another object, creating a semantic
leakage of properties between entities. Taken together, our study highlights
the differences between DALLE-2 and human language processing and opens an
avenue for future study on the inductive biases of text-to-image models.
Related papers
- Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - L2C: Describing Visual Differences Needs Semantic Understanding of
Individuals [65.87728481187625]
We introduce a Learning-to-Compare model, which learns to understand the semantic structures of two images and compare them while learning to describe each one.
We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs.
arXiv Detail & Related papers (2021-02-03T03:44:42Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.