Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text
- URL: http://arxiv.org/abs/2112.00800v1
- Date: Wed, 1 Dec 2021 19:41:03 GMT
- Title: Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text
- Authors: Christopher Clark, Jordi Salvador, Dustin Schwenk, Derrick Bonafilia,
Mark Yatskar, Eric Kolve, Alvaro Herrasti, Jonghyun Choi, Sachin Mehta, Sam
Skjonsberg, Carissa Schoenick, Aaron Sarnat, Hannaneh Hajishirzi, Aniruddha
Kembhavi, Oren Etzioni, Ali Farhadi
- Abstract summary: Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
- Score: 70.14613727284741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Communicating with humans is challenging for AIs because it requires a shared
understanding of the world, complex semantics (e.g., metaphors or analogies),
and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in
a diagram). We investigate these challenges in the context of Iconary, a
collaborative game of drawing and guessing based on Pictionary, that poses a
novel challenge for the research community. In Iconary, a Guesser tries to
identify a phrase that a Drawer is drawing by composing icons, and the Drawer
iteratively revises the drawing to help the Guesser in response. This
back-and-forth often uses canonical scenes, visual metaphor, or icon
compositions to express challenging words, making it an ideal test for mixing
language and visual/symbolic communication in AI. We propose models to play
Iconary and train them on over 55,000 games between human players. Our models
are skillful players and are able to employ world knowledge in language models
to play with words unseen during training. Elite human players outperform our
models, particularly at the drawing task, leaving an important gap for future
research to address. We release our dataset, code, and evaluation setup as a
challenge to the community at http://www.github.com/allenai/iconary.
Related papers
- IRFL: Image Recognition of Figurative Language [20.472997304393413]
Figurative forms are often conveyed through multiple modalities (e.g., both text and images)
We develop the Image Recognition of Figurative Language dataset.
We introduce two novel tasks as a benchmark for multimodal figurative language understanding.
arXiv Detail & Related papers (2023-03-27T17:59:55Z) - Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
Synthetic and Compositional Images [63.629345688220496]
We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
The dataset is comprised of purposefully commonsense-defying images created by designers.
Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!
arXiv Detail & Related papers (2023-03-13T16:49:43Z) - Infusing Commonsense World Models with Graph Knowledge [89.27044249858332]
We study the setting of generating narratives in an open world text adventure game.
A graph representation of the underlying game state can be used to train models that consume and output both grounded graph representations and natural language descriptions and actions.
arXiv Detail & Related papers (2023-01-13T19:58:27Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - Emergent Graphical Conventions in a Visual Communication Game [80.79297387339614]
Humans communicate with graphical sketches apart from symbolic languages.
We take the very first step to model and simulate such an evolution process via two neural agents playing a visual communication game.
We devise a novel reinforcement learning method such that agents are evolved jointly towards successful communication and abstract graphical conventions.
arXiv Detail & Related papers (2021-11-28T18:59:57Z) - IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
Language Reasoning [132.49090098391258]
We introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context.
We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank.
We further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes.
arXiv Detail & Related papers (2021-10-25T18:52:26Z) - Emergent Communication of Generalizations [13.14792537601313]
We argue that communicating about a single object in a shared visual context is prone to overfitting and does not encourage language useful beyond concrete reference.
We propose games that require communicating generalizations over sets of objects representing abstract visual concepts.
We find that these games greatly improve systematicity and interpretability of the learned languages.
arXiv Detail & Related papers (2021-06-04T19:02:18Z) - Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal
Description Generation [1.52292571922932]
Socially competent robots should be equipped with the ability to perceive the world that surrounds them and communicate about it in a human-like manner.
Representative skills that exhibit such ability include generating image descriptions and visually grounded referring expressions.
We propose to model the task of generating natural language together with free-hand sketches/hand gestures to describe visual scenes and real life objects.
arXiv Detail & Related papers (2021-01-14T23:40:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.