Related papers: MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

URL: http://arxiv.org/abs/2510.24178v1
Date: Tue, 28 Oct 2025 08:33:45 GMT
Title: MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Authors: Aaron Scott, Maike Züfle, Jan Niehues,
Abstract summary: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one.<n>We present MuSaG, the first German multimodal sarcasm detection dataset.<n>It consists of 33 minutes of manually selected and human-annotated statements from German television shows.
Score: 15.95945265244193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

Related papers

Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding [19.632399543819382]
Sarcasm detection remains a challenge in natural language understanding.<n>We systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English and Chinese.
arXiv Detail & Related papers (2025-09-18T22:44:27Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z)
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue [63.32199372362483]
We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE.<n>In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised.<n>We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
arXiv Detail & Related papers (2024-02-06T03:14:46Z)
Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2. We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks. Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z)
WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings. We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z)
When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues [27.884015521888458]
We study the discourse structure of sarcastic conversations and propose a novel task - Sarcasm Explanation in Dialogue (SED) SED aims to generate natural language explanations of satirical conversations. We propose MAF, a multimodal context-aware attention and global information fusion module to capture multimodality and use it to benchmark WITS.
arXiv Detail & Related papers (2022-03-12T12:16:07Z)
Exploiting BERT For Multimodal Target SentimentClassification Through Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.