I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create
Visual Metaphors
- URL: http://arxiv.org/abs/2305.14724v2
- Date: Fri, 14 Jul 2023 16:09:46 GMT
- Title: I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create
Visual Metaphors
- Authors: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou,
Yue Yang, Marianna Apidianaki, Smaranda Muresan
- Abstract summary: We propose a new task of generating visual metaphors from linguistic metaphors.
This is a challenging task for diffusion-based text-to-image models, since it requires the ability to model implicit meaning and compositionality.
We create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations.
- Score: 38.70166865926743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual metaphors are powerful rhetorical devices used to persuade or
communicate creative ideas through images. Similar to linguistic metaphors,
they convey meaning implicitly through symbolism and juxtaposition of the
symbols. We propose a new task of generating visual metaphors from linguistic
metaphors. This is a challenging task for diffusion-based text-to-image models,
such as DALL$\cdot$E 2, since it requires the ability to model implicit meaning
and compositionality. We propose to solve the task through the collaboration
between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3
(davinci-002) with Chain-of-Thought prompting generates text that represents a
visual elaboration of the linguistic metaphor containing the implicit meaning
and relevant objects, which is then used as input to the diffusion-based
text-to-image models.Using a human-AI collaboration framework, where humans
interact both with the LLM and the top-performing diffusion model, we create a
high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic
metaphors and their associated visual elaborations. Evaluation by professional
illustrators shows the promise of LLM-Diffusion Model collaboration for this
task . To evaluate the utility of our Human-AI collaboration framework and the
quality of our dataset, we perform both an intrinsic human-based evaluation and
an extrinsic evaluation using visual entailment as a downstream task.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - A framework for annotating and modelling intentions behind metaphor use [12.40493670580608]
We propose a novel taxonomy of intentions commonly attributed to metaphor, which comprises 9 categories.
We also release the first dataset annotated for intentions behind metaphor use.
We use this dataset to test the capability of large language models (LLMs) in inferring the intentions behind metaphor use, in zero- and in-context few-shot settings.
arXiv Detail & Related papers (2024-07-04T14:13:57Z) - Unveiling the Invisible: Captioning Videos with Metaphors [43.53477124719281]
We introduce a new Vision-Language (VL) task of describing the metaphors present in the videos in our work.
To facilitate this novel task, we construct and release a dataset with 705 videos and 2115 human-written captions.
We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task.
arXiv Detail & Related papers (2024-06-07T12:32:44Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - IRFL: Image Recognition of Figurative Language [20.472997304393413]
Figurative forms are often conveyed through multiple modalities (e.g., both text and images)
We develop the Image Recognition of Figurative Language dataset.
We introduce two novel tasks as a benchmark for multimodal figurative language understanding.
arXiv Detail & Related papers (2023-03-27T17:59:55Z) - MetaCLUE: Towards Comprehensive Visual Metaphors Research [43.604408485890275]
We introduce MetaCLUE, a set of vision tasks on visual metaphor.
We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations.
We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities.
arXiv Detail & Related papers (2022-12-19T22:41:46Z) - Metaphor Generation with Conceptual Mappings [58.61307123799594]
We aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs.
We propose to control the generation process by encoding conceptual mappings between cognitive domains.
We show that the unsupervised CM-Lex model is competitive with recent deep learning metaphor generation systems.
arXiv Detail & Related papers (2021-06-02T15:27:05Z) - MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding [22.756157298168127]
Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus.
For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data.
A task-based evaluation shows that human-written poems enhanced with metaphors are preferred 68% of the time compared to poems without metaphors.
arXiv Detail & Related papers (2021-03-11T16:39:19Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.