Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks
from The New Yorker Caption Contest
- URL: http://arxiv.org/abs/2209.06293v2
- Date: Thu, 6 Jul 2023 06:20:00 GMT
- Title: Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks
from The New Yorker Caption Contest
- Authors: Jack Hessel and Ana Marasovi\'c and Jena D. Hwang and Lillian Lee and
Jeff Da and Rowan Zellers and Robert Mankoff and Yejin Choi
- Abstract summary: Large neural networks can now generate jokes, but do they really "understand" humor?
We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest.
We find that both types of models struggle at all three tasks.
- Score: 70.40189243067857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large neural networks can now generate jokes, but do they really "understand"
humor? We challenge AI models with three tasks derived from the New Yorker
Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning
caption, and explaining why a winning caption is funny. These tasks encapsulate
progressively more sophisticated aspects of "understanding" a cartoon; key
elements are the complex, often surprising relationships between images and
captions and the frequent inclusion of indirect and playful allusions to human
experience and culture. We investigate both multimodal and language-only
models: the former are challenged with the cartoon images directly, while the
latter are given multifaceted descriptions of the visual scene to simulate
human-level visual understanding. We find that both types of models struggle at
all three tasks. For example, our best multimodal models fall 30 accuracy
points behind human performance on the matching task, and, even when provided
ground-truth visual scene descriptors, human-authored explanations are
preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in
more than 2/3 of cases. We release models, code, leaderboard, and corpus, which
includes newly-gathered annotations describing the image's locations/entities,
what's unusual in the scene, and an explanation of the joke.
Related papers
- Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! [14.84123301554462]
We present UNPIE, a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities.
Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings.
The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context.
arXiv Detail & Related papers (2024-10-01T19:32:57Z) - PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z) - Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions [16.23585043442914]
This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction.
We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics.
Our results show that even state-of-the-art models still lag behind human performance on this task.
arXiv Detail & Related papers (2024-05-29T13:51:43Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Iconary: A Pictionary-Based Game for Testing Multimodal Communication
with Drawings and Text [70.14613727284741]
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics, and at times multi-modal gestures.
We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary.
We propose models to play Iconary and train them on over 55,000 games between human players.
arXiv Detail & Related papers (2021-12-01T19:41:03Z) - Goal-driven text descriptions for images [7.059848512713061]
This thesis focuses on generating textual output given visual input.
We use a comprehension machine to guide the generated referring expressions to be more discriminative.
In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
arXiv Detail & Related papers (2021-08-28T05:10:38Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.