WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models
- URL: http://arxiv.org/abs/2207.12576v1
- Date: Mon, 25 Jul 2022 23:57:44 GMT
- Title: WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models
- Authors: Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit
Bansal, Gabriel Stanovsky, Roy Schwartz
- Abstract summary: In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
- Score: 91.92346150646007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision-and-language models perform well on tasks such as visual
question answering, they struggle when it comes to basic human commonsense
reasoning skills. In this work, we introduce WinoGAViL: an online game to
collect vision-and-language associations, (e.g., werewolves to a full moon),
used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by
the popular card game Codenames, a spymaster gives a textual cue related to
several visual candidates, and another player has to identify them. Human
players are rewarded for creating associations that are challenging for a rival
AI model but still solvable by other human players. We use the game to collect
3.5K instances, finding that they are intuitive for humans (>90% Jaccard index)
but challenging for state-of-the-art AI models, where the best model (ViLT)
achieves a score of 52%, succeeding mostly where the cue is visually salient.
Our analysis as well as the feedback we collect from players indicate that the
collected associations require diverse reasoning skills, including general
knowledge, common sense, abstraction, and more. We release the dataset, the
code and the interactive game, aiming to allow future data collection that can
be used to develop models with better association abilities.
Related papers
- Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game [20.64536059771047]
We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players.
Our results show that even the best performing LLM, Claude 3.5 Sonnet, can only fully solve 18% of the games.
We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game.
arXiv Detail & Related papers (2024-06-16T17:10:32Z) - PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts.
We find that they are not able to generalize well to simple abstract patterns.
Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z) - ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations [26.4215586218117]
This work investigates how people use text-to-image models to generate desired target images.
We created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target.
We recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image.
arXiv Detail & Related papers (2023-06-13T21:10:45Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
Synthetic and Compositional Images [63.629345688220496]
We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
The dataset is comprised of purposefully commonsense-defying images created by designers.
Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!
arXiv Detail & Related papers (2023-03-13T16:49:43Z) - CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models.
In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z) - Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z) - AI in (and for) Games [0.9920773256693857]
This chapter outlines the relation between artificial intelligence (AI) / machine learning (ML) algorithms and digital games.
On one hand, AI/ML researchers can generate large, in-the-wild datasets of human affective activity, player behaviour.
On the other hand, games can utilise intelligent algorithms to automate testing of game levels, generate content, develop intelligent and responsive non-player characters (NPCs) or predict and respond player behaviour.
arXiv Detail & Related papers (2021-05-07T08:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.