Related papers: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

URL: http://arxiv.org/abs/2504.09049v1
Date: Sat, 12 Apr 2025 02:19:53 GMT
Title: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Authors: Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui,
Abstract summary: In light of the widespread adoption of Large Language Models, the intersection of humor and AI has become no laughing matter.<n>In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript.<n>We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines.
Score: 6.124881326867511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.

Related papers

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Can Pre-trained Language Models Understand Chinese Humor? [74.96509580592004]
This paper is the first work that systematically investigates the humor understanding ability of pre-trained language models (PLMs) We construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework. Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.
arXiv Detail & Related papers (2024-07-04T18:13:38Z)
Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models [27.936545041302377]
Large language models (LLMs) can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators.
arXiv Detail & Related papers (2024-02-23T02:58:12Z)
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like [49.2096391012794]
ELaTE is a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS. We show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models.
arXiv Detail & Related papers (2024-02-12T02:58:10Z)
This joke is [MASK]: Recognizing Humor and Offense with Prompting [9.745213455946324]
Humor is a magnetic component in everyday human interactions and communications. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition.
arXiv Detail & Related papers (2022-10-25T13:02:45Z)
Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results [84.37263300062597]
Humor is a substantial element of human social behavior, affect, and cognition. Current methods of humor detection have been exclusively based on staged data, making them inadequate for "real-world" applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor dataset, comprising about 11 hours of recordings.
arXiv Detail & Related papers (2022-09-28T17:36:47Z)
DeHumor: Visual Analytics for Decomposing Humor [36.300283476950796]
We develop DeHumor, a visual system for analyzing humorous behaviors in public speaking. To intuitively reveal the building blocks of each concrete example, DeHumor decomposes each humorous video into multimodal features. We show that DeHumor is able to highlight various building blocks of humor examples.
arXiv Detail & Related papers (2021-07-18T04:01:07Z)
Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition [0.6445605125467573]
We break down any joke into two distinct components: the set-up and the punchline. Inspired by the incongruity theory of humor, we model the set-up as the part developing semantic uncertainty. With increasingly powerful language models, we were able to feed the set-up along with the punchline into the GPT-2 language model.
arXiv Detail & Related papers (2020-12-22T13:48:09Z)
Predicting the Humorousness of Tweets Using Gaussian Process Preference Learning [56.18809963342249]
We present a probabilistic approach that learns to rank and rate the humorousness of short texts by exploiting human preference judgments and automatically sourced linguistic annotations. We report system performance for the campaign's two subtasks, humour detection and funniness score prediction, and discuss some issues arising from the conversion between the numeric scores used in the HAHA@IberLEF 2019 data and the pairwise judgment annotations required for our method.
arXiv Detail & Related papers (2020-08-03T13:05:42Z)
"The Boating Store Had Its Best Sail Ever": Pronunciation-attentive Contextualized Pun Recognition [80.59427655743092]
We propose Pronunciation-attentive Contextualized Pun Recognition (PCPR) to perceive human humor. PCPR derives contextualized representation for each word in a sentence by capturing the association between the surrounding context and its corresponding phonetic symbols. Results demonstrate that the proposed approach significantly outperforms the state-of-the-art methods in pun detection and location tasks.
arXiv Detail & Related papers (2020-04-29T20:12:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.