Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks
from The New Yorker Caption Contest
- URL: http://arxiv.org/abs/2209.06293v2
- Date: Thu, 6 Jul 2023 06:20:00 GMT
- Title: Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks
from The New Yorker Caption Contest
- Authors: Jack Hessel and Ana Marasovi\'c and Jena D. Hwang and Lillian Lee and
Jeff Da and Rowan Zellers and Robert Mankoff and Yejin Choi
- Abstract summary: Large neural networks can now generate jokes, but do they really "understand" humor?
We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest.
We find that both types of models struggle at all three tasks.
- Score: 70.40189243067857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large neural networks can now generate jokes, but do they really "understand"
humor? We challenge AI models with three tasks derived from the New Yorker
Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning
caption, and explaining why a winning caption is funny. These tasks encapsulate
progressively more sophisticated aspects of "understanding" a cartoon; key
elements are the complex, often surprising relationships between images and
captions and the frequent inclusion of indirect and playful allusions to human
experience and culture. We investigate both multimodal and language-only
models: the former are challenged with the cartoon images directly, while the
latter are given multifaceted descriptions of the visual scene to simulate
human-level visual understanding. We find that both types of models struggle at
all three tasks. For example, our best multimodal models fall 30 accuracy
points behind human performance on the matching task, and, even when provided
ground-truth visual scene descriptors, human-authored explanations are
preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in
more than 2/3 of cases. We release models, code, leaderboard, and corpus, which
includes newly-gathered annotations describing the image's locations/entities,
what's unusual in the scene, and an explanation of the joke.
Related papers
- Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humor [8.75275650545552]
HumorDB is an image-only dataset specifically designed to advance visual humor understanding.
The dataset enables evaluation through binary classification, range regression, and pairwise comparison tasks.
HumorDB shows potential as a valuable benchmark for powerful large multimodal models.
arXiv Detail & Related papers (2024-06-19T13:51:40Z) - Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions [16.23585043442914]
This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction.
We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics.
Our results show that even state-of-the-art models still lag behind human performance on this task.
arXiv Detail & Related papers (2024-05-29T13:51:43Z) - Comics for Everyone: Generating Accessible Text Descriptions for Comic
Strips [0.0]
We create natural language descriptions of comic strips that are accessible to the visually impaired community.
Our method consists of two steps: first, we use computer vision techniques to extract information about the panels, characters, and text of the comic images.
We test our method on a collection of comics that have been annotated by human experts and measure its performance using both quantitative and qualitative metrics.
arXiv Detail & Related papers (2023-10-01T15:13:48Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z) - Goal-driven text descriptions for images [7.059848512713061]
This thesis focuses on generating textual output given visual input.
We use a comprehension machine to guide the generated referring expressions to be more discriminative.
In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
arXiv Detail & Related papers (2021-08-28T05:10:38Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.