IRFL: Image Recognition of Figurative Language
- URL: http://arxiv.org/abs/2303.15445v3
- Date: Sat, 25 Nov 2023 22:07:55 GMT
- Title: IRFL: Image Recognition of Figurative Language
- Authors: Ron Yosef, Yonatan Bitton, Dafna Shahaf
- Abstract summary: Figurative forms are often conveyed through multiple modalities (e.g., both text and images)
We develop the Image Recognition of Figurative Language dataset.
We introduce two novel tasks as a benchmark for multimodal figurative language understanding.
- Score: 20.472997304393413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Figures of speech such as metaphors, similes, and idioms are integral parts
of human communication. They are ubiquitous in many forms of discourse,
allowing people to convey complex, abstract ideas and evoke emotion. As
figurative forms are often conveyed through multiple modalities (e.g., both
text and images), understanding multimodal figurative language is an important
AI challenge, weaving together profound vision, language, commonsense and
cultural knowledge. In this work, we develop the Image Recognition of
Figurative Language (IRFL) dataset. We leverage human annotation and an
automatic pipeline we created to generate a multimodal dataset, and introduce
two novel tasks as a benchmark for multimodal figurative language
understanding. We experimented with state-of-the-art vision and language models
and found that the best (22%) performed substantially worse than humans (97%).
We release our dataset, benchmark, and code, in hopes of driving the
development of models that can better understand figurative language.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! [14.84123301554462]
We present UNPIE, a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities.
Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings.
The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context.
arXiv Detail & Related papers (2024-10-01T19:32:57Z) - Multilingual Multi-Figurative Language Detection [14.799109368073548]
figurative language understanding is highly understudied in a multilingual setting.
We introduce multilingual multi-figurative language modelling, and provide a benchmark for sentence-level figurative language detection.
We develop a framework for figurative language detection based on template-based prompt learning.
arXiv Detail & Related papers (2023-05-31T18:52:41Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create
Visual Metaphors [38.70166865926743]
We propose a new task of generating visual metaphors from linguistic metaphors.
This is a challenging task for diffusion-based text-to-image models, since it requires the ability to model implicit meaning and compositionality.
We create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations.
arXiv Detail & Related papers (2023-05-24T05:01:10Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multi-Figurative Language Generation [14.13782709351219]
Figurative language generation is the task of reformulating a given text in the desired figure of speech while still being faithful to the original context.
We take the first step towards multi-figurative language modelling by providing a benchmark for the automatic generation of five common figurative forms in English.
arXiv Detail & Related papers (2022-09-05T08:48:09Z) - Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z) - Investigating Robustness of Dialog Models to Popular Figurative Language
Constructs [30.841109045790862]
We analyze the performance of existing dialog models in situations where the input dialog context exhibits use of figurative language.
We propose lightweight solutions to help existing models become more robust to figurative language.
arXiv Detail & Related papers (2021-10-01T23:55:16Z) - It's not Rocket Science : Interpreting Figurative Language in Narratives [48.84507467131819]
We study the interpretation of two non-compositional figurative languages (idioms and similes)
Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks.
We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language.
arXiv Detail & Related papers (2021-08-31T21:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.