Related papers: Oogiri-Master: Benchmarking Humor Understanding via Oogiri

Oogiri-Master: Benchmarking Humor Understanding via Oogiri

URL: http://arxiv.org/abs/2512.21494v1
Date: Thu, 25 Dec 2025 03:59:20 GMT
Title: Oogiri-Master: Benchmarking Humor Understanding via Oogiri
Authors: Soichiro Murakami, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura,
Abstract summary: We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt.<n>Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness.<n>We introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in large language models.
Score: 53.060893644603844
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

Related papers

Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs [53.060893644603844]
Humor preferences vary widely across individuals and cultures, complicating the evaluation of humor using large language models (LLMs)<n>In this study, we model heterogeneity in humor preferences in Oogiri, a Japanese creative response game, by clustering users with voting logs and estimating cluster-specific weights over interpretable preference factors using Bradley-Terry-Luce models.
arXiv Detail & Related papers (2026-01-06T15:33:45Z)
Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation [11.402855509329711]
Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications.<n>Previous studies have benchmarked the humor capabilities of Large Language Models (LLMs)<n>This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri.
arXiv Detail & Related papers (2025-11-12T09:16:58Z)
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy [6.124881326867511]
In light of the widespread adoption of Large Language Models, the intersection of humor and AI has become no laughing matter.<n>In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript.<n>We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines.
arXiv Detail & Related papers (2025-04-12T02:19:53Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [38.822535662755314]
We propose a sample-efficient human evaluation method for large language models (LLMs)<n>Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses.<n>Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating.
arXiv Detail & Related papers (2024-04-10T01:26:24Z)
The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation [16.591822946975547]
We argue that the generation of more esoteric forms of language constitutes a subdomain where the characteristics of selected evaluator panels are of utmost importance. We perform a critical survey of recent works in NLG to assess how well evaluation procedures are reported in this subdomain. We note a severe lack of open reporting of evaluator demographic information, and a significant reliance on crowdsourcing platforms for recruitment.
arXiv Detail & Related papers (2023-11-09T17:50:23Z)
SOUL: Towards Sentiment and Opinion Understanding of Language [96.74878032417054]
We propose a new task called Sentiment and Opinion Understanding of Language (SOUL) SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
arXiv Detail & Related papers (2023-10-27T06:48:48Z)
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.