Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor
- URL: http://arxiv.org/abs/2412.05315v1
- Date: Sun, 01 Dec 2024 06:49:31 GMT
- Title: Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor
- Authors: Ashwin Baluja,
- Abstract summary: Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning.<n>We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.
Related papers
- On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation [10.157232656580659]
Humor is a commonly used and intricate human language in daily life.<n>We develop a novel humor generation mechanism based on a fundamental humor theory, GTVH.<n>To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework.
arXiv Detail & Related papers (2026-02-06T06:41:33Z) - V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs [72.59885036868499]
v-HUB is a visual-centric video humor understanding benchmark.<n>Each video clip is paired with rich annotations, including captions, descriptions, and explanations.<n>We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio.
arXiv Detail & Related papers (2025-09-30T04:33:52Z) - Beyond Words: Multimodal LLM Knows When to Speak [25.374878759869333]
We focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text.<n>We introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams.<n>We propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate.
arXiv Detail & Related papers (2025-05-20T17:42:34Z) - Can Pre-trained Language Models Understand Chinese Humor? [74.96509580592004]
This paper is the first work that systematically investigates the humor understanding ability of pre-trained language models (PLMs)
We construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework.
Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.
arXiv Detail & Related papers (2024-07-04T18:13:38Z) - Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models [27.936545041302377]
Large language models (LLMs) can generate synthetic data for humor detection via editing texts.
We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes.
We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators.
arXiv Detail & Related papers (2024-02-23T02:58:12Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - M2-CTTS: End-to-End Multi-scale Multi-modal Conversational
Text-to-Speech Synthesis [38.85861825252267]
M2-CTTS aims to comprehensively utilize historical conversation and enhance prosodic expression.
We design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling.
arXiv Detail & Related papers (2023-05-03T16:59:38Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results [84.37263300062597]
Humor is a substantial element of human social behavior, affect, and cognition.
Current methods of humor detection have been exclusively based on staged data, making them inadequate for "real-world" applications.
We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor dataset, comprising about 11 hours of recordings.
arXiv Detail & Related papers (2022-09-28T17:36:47Z) - Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection [76.62550719834722]
We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
arXiv Detail & Related papers (2021-10-21T07:51:56Z) - M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in
Conversations [72.81164101048181]
We propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series "Shrimaan Shrimati Phir Se"
Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities.
The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition.
arXiv Detail & Related papers (2021-08-03T02:54:09Z) - DeHumor: Visual Analytics for Decomposing Humor [36.300283476950796]
We develop DeHumor, a visual system for analyzing humorous behaviors in public speaking.
To intuitively reveal the building blocks of each concrete example, DeHumor decomposes each humorous video into multimodal features.
We show that DeHumor is able to highlight various building blocks of humor examples.
arXiv Detail & Related papers (2021-07-18T04:01:07Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.