Related papers: Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

URL: http://arxiv.org/abs/2209.06293v2
Date: Thu, 6 Jul 2023 06:20:00 GMT
Title: Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest
Authors: Jack Hessel and Ana Marasovi\'c and Jena D. Hwang and Lillian Lee and Jeff Da and Rowan Zellers and Robert Mankoff and Yejin Choi
Abstract summary: Large neural networks can now generate jokes, but do they really "understand" humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest. We find that both types of models struggle at all three tasks.
Score: 70.40189243067857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large neural networks can now generate jokes, but do they really "understand" humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of "understanding" a cartoon; key elements are the complex, often surprising relationships between images and captions and the frequent inclusion of indirect and playful allusions to human experience and culture. We investigate both multimodal and language-only models: the former are challenged with the cartoon images directly, while the latter are given multifaceted descriptions of the visual scene to simulate human-level visual understanding. We find that both types of models struggle at all three tasks. For example, our best multimodal models fall 30 accuracy points behind human performance on the matching task, and, even when provided ground-truth visual scene descriptors, human-authored explanations are preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in more than 2/3 of cases. We release models, code, leaderboard, and corpus, which includes newly-gathered annotations describing the image's locations/entities, what's unusual in the scene, and an explanation of the joke.

Related papers

Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench [16.929265302194782]
HumorBench is a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions.<n>LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements.
arXiv Detail & Related papers (2025-07-29T03:44:43Z)
From Panels to Prose: Generating Literary Narratives from Comics [55.544015596503726]
We develop an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters.
arXiv Detail & Related papers (2025-03-30T07:18:10Z)
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! [14.84123301554462]
We present UNPIE, a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context.
arXiv Detail & Related papers (2024-10-01T19:32:57Z)
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z)
HumorDB: Can AI understand graphical humor? [8.75275650545552]
This paper introduces textbfHumorDB, a dataset designed to evaluate and advance visual humor understanding by AI systems.<n>We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison.<n>The results reveal a gap between current AI systems and human-level humor understanding.
arXiv Detail & Related papers (2024-06-19T13:51:40Z)
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions [16.23585043442914]
This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics. Our results show that even state-of-the-art models still lag behind human performance on this task.
arXiv Detail & Related papers (2024-05-29T13:51:43Z)
Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe) Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z)
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures. We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary. We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z)
Goal-driven text descriptions for images [7.059848512713061]
This thesis focuses on generating textual output given visual input. We use a comprehension machine to guide the generated referring expressions to be more discriminative. In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
arXiv Detail & Related papers (2021-08-28T05:10:38Z)
MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech. MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN) It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.