On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
- URL: http://arxiv.org/abs/2602.06423v1
- Date: Fri, 06 Feb 2026 06:41:33 GMT
- Title: On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
- Authors: Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang,
- Abstract summary: Humor is a commonly used and intricate human language in daily life.<n>We develop a novel humor generation mechanism based on a fundamental humor theory, GTVH.<n>To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework.
- Score: 10.157232656580659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
Related papers
- Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs [53.57402214935238]
Sketch-in-Latents is a novel paradigm for unified multi-modal reasoning.<n>It generates continuous visual embeddings, termed latent sketch tokens, as visual thoughts.<n>It achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks.
arXiv Detail & Related papers (2025-12-18T14:29:41Z) - HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation [13.49193658655368]
Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation.<n>We propose HUMORCHAIN, a theory-guided multi-stage reasoning framework.<n>It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation.
arXiv Detail & Related papers (2025-11-21T09:52:46Z) - V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs [72.59885036868499]
v-HUB is a visual-centric video humor understanding benchmark.<n>Each video clip is paired with rich annotations, including captions, descriptions, and explanations.<n>We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio.
arXiv Detail & Related papers (2025-09-30T04:33:52Z) - Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench [16.929265302194782]
HumorBench is a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions.<n>LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements.
arXiv Detail & Related papers (2025-07-29T03:44:43Z) - From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy [6.124881326867511]
In light of the widespread adoption of Large Language Models, the intersection of humor and AI has become no laughing matter.<n>In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript.<n>We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines.
arXiv Detail & Related papers (2025-04-12T02:19:53Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor [0.0]
Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning.<n>We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system.
arXiv Detail & Related papers (2024-12-01T06:49:31Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps [34.35304020094762]
Humor is a nuanced aspect of human language, presenting challenges for its understanding and generation.<n>Due to the sparsity of the knowledge graph in creative thinking, it is arduous to achieve multi-hop reasoning.<n>We propose a more robust framework for addressing the humor reasoning task, named LoL.
arXiv Detail & Related papers (2024-10-14T10:50:16Z) - Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG)
RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models.
Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z) - SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen
LLMs [124.29233620842462]
We introduce SPAE for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
The resulting lexical tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction.
Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
arXiv Detail & Related papers (2023-06-30T17:59:07Z) - DeHumor: Visual Analytics for Decomposing Humor [36.300283476950796]
We develop DeHumor, a visual system for analyzing humorous behaviors in public speaking.
To intuitively reveal the building blocks of each concrete example, DeHumor decomposes each humorous video into multimodal features.
We show that DeHumor is able to highlight various building blocks of humor examples.
arXiv Detail & Related papers (2021-07-18T04:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.