Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
Synthetic and Compositional Images
- URL: http://arxiv.org/abs/2303.07274v4
- Date: Sat, 12 Aug 2023 22:37:31 GMT
- Title: Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
Synthetic and Compositional Images
- Authors: Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt,
Yuval Elovici, Gabriel Stanovsky, Roy Schwartz
- Abstract summary: We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
The dataset is comprised of purposefully commonsense-defying images created by designers.
Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!
- Score: 63.629345688220496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weird, unusual, and uncanny images pique the curiosity of observers because
they challenge commonsense. For example, an image released during the 2022
world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo
playing chess, which playfully violates our expectation that their competition
should occur on the football field. Humans can easily recognize and interpret
these unconventional images, but can AI models do the same? We introduce
WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is
comprised of purposefully commonsense-defying images created by designers using
publicly-available image generation tools like Midjourney. We consider several
tasks posed over the dataset. In addition to image captioning, cross-modal
matching, and visual question answering, we introduce a difficult explanation
generation task, where models must identify and explain why a given image is
unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2
still lag behind human performance on WHOOPS!. We hope our dataset will inspire
the development of AI models with stronger visual commonsense reasoning
abilities. Data, models and code are available at the project website:
whoops-benchmark.github.io
Related papers
- Zero-Shot Detection of AI-Generated Images [54.01282123570917]
We propose a zero-shot entropy-based detector (ZED) to detect AI-generated images.
Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images.
ZED achieves an average improvement of more than 3% over the SoTA in terms of accuracy.
arXiv Detail & Related papers (2024-09-24T08:46:13Z) - Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images [67.18010640829682]
We show that AI-generated images introduce an invisible relevance bias to text-image retrieval models.
The inclusion of AI-generated images in the training data of the retrieval models exacerbates the invisible relevance bias.
We propose an effective training method aimed at alleviating the invisible relevance bias.
arXiv Detail & Related papers (2023-11-23T16:22:58Z) - Towards Pragmatic Semantic Image Synthesis for Urban Scenes [4.36080478413575]
We present a new task: given a dataset with synthetic images and labels and a dataset with unlabeled real images, our goal is to learn a model that can generate images with the content of the input mask and the appearance of real images.
We leverage the synthetic image as a guide to the content of the generated image by penalizing the difference between their high-level features on a patch level.
In contrast to previous works which employ one discriminator that overfits the target domain semantic distribution, we employ a discriminator for the whole image and multiscale discriminators on the image patches.
arXiv Detail & Related papers (2023-05-16T18:01:12Z) - COFAR: Commonsense and Factual Reasoning in Image Search [2.6354148238224697]
One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent.
We present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge.
This unified framework is then used to perform image search requiring commonsense and factual reasoning.
arXiv Detail & Related papers (2022-10-16T14:43:13Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture.
The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses.
We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z) - Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.