Related papers: Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

URL: http://arxiv.org/abs/2510.18454v1
Date: Tue, 21 Oct 2025 09:28:09 GMT
Title: Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Authors: Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, Dinesh Manocha,
Abstract summary: Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs.<n>This work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by measuring humor, stereotypicality, and toxicity.
Score: 55.98686105081078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.

Related papers

Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective [104.09817371557476]
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks.<n>Their potential to generate harmful content has raised serious safety concerns.<n>We introduce three novel multi-label benchmarks for toxicity detection.
arXiv Detail & Related papers (2025-10-16T06:50:33Z)
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy [6.124881326867511]
In light of the widespread adoption of Large Language Models, the intersection of humor and AI has become no laughing matter.<n>In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript.<n>We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines.
arXiv Detail & Related papers (2025-04-12T02:19:53Z)
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content [0.0]
We present the Deceptive Humor dataset (DHD) a collection of humor-infused comments derived from fabricated claims.<n>Each entry is labeled with a Satire Level (from 1 for subtle satire to 3 for overt satire) and categorized into five humor types.<n>The dataset spans English, Telugu, Hindi, Kannada, Tamil, and their code-mixed forms, making it a valuable resource for multilingual analysis.
arXiv Detail & Related papers (2025-03-20T10:58:02Z)
Aligned Probing: Relating Toxic Behavior and Model Internals [78.20380492883022]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs)<n>Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time.<n>Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z)
CleanComedy: Creating Friendly Humor through Generative Techniques [5.720553544629197]
This paper proposes CleanComedy, a specialized, partially annotated toxicity-filtered corpus of English and Russian jokes.<n>We study the effectiveness of our data filtering approach through a survey on humor and toxicity levels in various joke groups.<n>In addition, we study advances in computer humor generation by comparing jokes written by humans with various groups of generative jokes, including our baseline models trained on the CleanComedy datasets.
arXiv Detail & Related papers (2024-12-12T11:57:59Z)
Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models [27.936545041302377]
Large language models (LLMs) can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators.
arXiv Detail & Related papers (2024-02-23T02:58:12Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results [84.37263300062597]
Humor is a substantial element of human social behavior, affect, and cognition. Current methods of humor detection have been exclusively based on staged data, making them inadequate for "real-world" applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor dataset, comprising about 11 hours of recordings.
arXiv Detail & Related papers (2022-09-28T17:36:47Z)
Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition [0.6445605125467573]
We break down any joke into two distinct components: the set-up and the punchline. Inspired by the incongruity theory of humor, we model the set-up as the part developing semantic uncertainty. With increasingly powerful language models, we were able to feed the set-up along with the punchline into the GPT-2 language model.
arXiv Detail & Related papers (2020-12-22T13:48:09Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.