Discovering the Hidden Vocabulary of DALLE-2
- URL: http://arxiv.org/abs/2206.00169v1
- Date: Wed, 1 Jun 2022 01:14:48 GMT
- Title: Discovering the Hidden Vocabulary of DALLE-2
- Authors: Giannis Daras and Alexandros G. Dimakis
- Abstract summary: We find that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts.
For example, it seems that textttApoploe vesrreaitais means birds and textttContarra ccetnxniams luryca tanniounons (sometimes) means bugs or pests.
- Score: 96.19666636109729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We discover that DALLE-2 seems to have a hidden vocabulary that can be used
to generate images with absurd prompts. For example, it seems that
\texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams
luryca tanniounons} (sometimes) means bugs or pests. We find that these prompts
are often consistent in isolation but also sometimes in combinations. We
present our black-box method to discover words that seem random but have some
correspondence to visual concepts. This creates important security and
interpretability challenges.
Related papers
- When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding [72.15848305976706]
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning.<n>When confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content.<n>We propose a training-free semantic hallucination mitigation framework comprising two key components.
arXiv Detail & Related papers (2025-06-05T19:53:19Z) - SlangDIT: Benchmarking LLMs in Interpretative Slang Translation [89.48208612476068]
This paper introduces the interpretative slang translation task (named SlangDIT)<n>It consists of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context.<n>Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning.
arXiv Detail & Related papers (2025-05-20T10:37:34Z) - Text-to-Image Generation for Vocabulary Learning Using the Keyword Method [9.862827991755076]
The 'keyword method' is an effective technique for learning vocabulary of a foreign language.
It involves creating a memorable visual link between what a word means and what its pronunciation in a foreign language sounds like.
We developed an application that combines the keyword method with text-to-image generators to externalise the memorable visual links into visuals.
arXiv Detail & Related papers (2025-01-28T17:39:50Z) - Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models [51.50892380172863]
We show that most state-of-the-art MLLMs suffer from severe verb hallucination.
We propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination.
arXiv Detail & Related papers (2024-12-06T10:53:47Z) - From cart to truck: meaning shift through words in English in the last two centuries [0.0]
This onomasiological study uses diachronic word embeddings to explore how different words represented the same concepts over time.
We identify shifts in energy, transport, entertainment, and computing domains, revealing connections between language and societal changes.
arXiv Detail & Related papers (2024-08-29T02:05:39Z) - Kiki or Bouba? Sound Symbolism in Vision-and-Language Models [13.300199242824934]
We show that sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion.
Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools.
arXiv Detail & Related papers (2023-10-25T17:15:55Z) - Discovering Failure Modes of Text-guided Diffusion Models via
Adversarial Search [52.519433040005126]
Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly.
In this work, we aim to study and understand the failure modes of TDMs in more detail.
We propose SAGE, the first adversarial search method on TDMs.
arXiv Detail & Related papers (2023-06-01T17:59:00Z) - Schr\"{o}dinger's Bat: Diffusion Models Sometimes Generate Polysemous
Words in Superposition [71.45263447328374]
Recent work has shown that text-to-image diffusion models can display strange behaviours when a prompt contains a word with multiple possible meanings.
We show that when given an input that is the sum of encodings of two distinct words, the model can produce an image containing both concepts represented in the sum.
We then demonstrate that the CLIP encoder used to encode prompts encodes polysemous words as a superposition of meanings, and that using linear algebraic techniques we can edit these representations to influence the senses represented in the generated images.
arXiv Detail & Related papers (2022-11-23T16:26:49Z) - Detecting Euphemisms with Literal Descriptions and Visual Imagery [18.510509701709054]
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022.
In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement.
In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost.
arXiv Detail & Related papers (2022-11-08T21:50:05Z) - Visual Keyword Spotting with Attention [82.79015266453533]
We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
arXiv Detail & Related papers (2021-10-29T17:59:04Z) - Euphemistic Phrase Detection by Masked Language Model [9.49544185939481]
We perform phrase mining on a social media corpus to extract quality phrases.
Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates.
We report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.
arXiv Detail & Related papers (2021-09-10T04:57:30Z) - Towards Dark Jargon Interpretation in Underground Forums [37.15748678894555]
We present a novel method towards automatically identifying and interpreting dark jargons.
We formalize the problem as a mapping from dark words to "clean" words with no hidden meaning.
Our method makes use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary.
arXiv Detail & Related papers (2020-11-05T18:08:32Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z) - Decomposing Word Embedding with the Capsule Network [23.294890047230584]
We propose a capsule network-based method to Decompose the unsupervised word Embedding of an ambiguous word into context specific Sense embedding.
With attention operations, CapsDecE2S integrates the word context to reconstruct the multiple morpheme-like vectors into the context-specific sense embedding.
In this method, we convert the sense learning into a binary classification that explicitly learns the relation between senses by the label of matching and non-matching.
arXiv Detail & Related papers (2020-04-07T06:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.