Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding
- URL: http://arxiv.org/abs/2509.15476v1
- Date: Thu, 18 Sep 2025 22:44:27 GMT
- Title: Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding
- Authors: Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler,
- Abstract summary: Sarcasm detection remains a challenge in natural language understanding.<n>We systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English and Chinese.
- Score: 19.632399543819382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.
Related papers
- MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations [15.95945265244193]
Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one.<n>We present MuSaG, the first German multimodal sarcasm detection dataset.<n>It consists of 33 minutes of manually selected and human-annotated statements from German television shows.
arXiv Detail & Related papers (2025-10-28T08:33:45Z) - Can Large Vision-Language Models Understand Multimodal Sarcasm? [14.863320201956963]
Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings.<n>We evaluate Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) tasks.<n>We propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge.
arXiv Detail & Related papers (2025-08-05T17:05:11Z) - Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models [10.47267683821842]
We propose an innovative multi-modal Commander-GPT framework for sarcasm detection.<n>Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks.<n>A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task.<n>Our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score.
arXiv Detail & Related papers (2025-03-24T13:53:00Z) - RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection [1.023096557577223]
We propose a relational context learning and multiplex fusion network (RCLMuFN) for multimodal sarcasm detection.<n> Firstly, we employ four feature extractors to comprehensively extract features from raw text and images.<n> Secondly, we utilize the relational context learning module to learn the contextual information of text and images.
arXiv Detail & Related papers (2024-12-17T15:29:31Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection [76.62550719834722]
We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
arXiv Detail & Related papers (2021-10-21T07:51:56Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.