HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
- URL: http://arxiv.org/abs/2511.21732v1
- Date: Fri, 21 Nov 2025 09:52:46 GMT
- Title: HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
- Authors: Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su,
- Abstract summary: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation.<n>We propose HUMORCHAIN, a theory-guided multi-stage reasoning framework.<n>It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation.
- Score: 13.49193658655368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.
Related papers
- On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation [10.157232656580659]
Humor is a commonly used and intricate human language in daily life.<n>We develop a novel humor generation mechanism based on a fundamental humor theory, GTVH.<n>To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework.
arXiv Detail & Related papers (2026-02-06T06:41:33Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them.<n>Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set.<n>We propose GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem.
arXiv Detail & Related papers (2025-12-19T14:41:50Z) - Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs [53.57402214935238]
Sketch-in-Latents is a novel paradigm for unified multi-modal reasoning.<n>It generates continuous visual embeddings, termed latent sketch tokens, as visual thoughts.<n>It achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks.
arXiv Detail & Related papers (2025-12-18T14:29:41Z) - Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps [34.35304020094762]
Humor is a nuanced aspect of human language, presenting challenges for its understanding and generation.<n>Due to the sparsity of the knowledge graph in creative thinking, it is arduous to achieve multi-hop reasoning.<n>We propose a more robust framework for addressing the humor reasoning task, named LoL.
arXiv Detail & Related papers (2024-10-14T10:50:16Z) - THInC: A Theory-Driven Framework for Computational Humor Detection [2.0960189135529212]
There is still no agreement on a single, comprehensive humor theory.
Most computational approaches to detecting humor are not based on existing humor theories.
This paper contributes to bridging this long-standing gap by creating an interpretable framework for humor classification.
arXiv Detail & Related papers (2024-09-02T13:09:26Z) - HumorDB: Can AI understand graphical humor? [10.207371106800187]
This paper introduces textbfHumorDB, a dataset designed to evaluate and advance visual humor understanding by AI systems.<n>We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison.<n>The results reveal a gap between current AI systems and human-level humor understanding.
arXiv Detail & Related papers (2024-06-19T13:51:40Z) - From Word Models to World Models: Translating from Natural Language to
the Probabilistic Language of Thought [124.40905824051079]
We propose rational meaning construction, a computational framework for language-informed thinking.
We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought.
We show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings.
We extend our framework to integrate cognitively-motivated symbolic modules.
arXiv Detail & Related papers (2023-06-22T05:14:00Z) - Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results [84.37263300062597]
Humor is a substantial element of human social behavior, affect, and cognition.
Current methods of humor detection have been exclusively based on staged data, making them inadequate for "real-world" applications.
We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor dataset, comprising about 11 hours of recordings.
arXiv Detail & Related papers (2022-09-28T17:36:47Z) - Advancing Humor-Focused Sentiment Analysis through Improved
Contextualized Embeddings and Model Architecture [0.0]
Humor allows us to express thoughts and feelings conveniently and effectively.
As language models become ubiquitous through virtual-assistants and IOT devices, the need to develop humor-aware models rises exponentially.
arXiv Detail & Related papers (2020-11-23T22:30:32Z) - Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI)
This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.