Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
- URL: http://arxiv.org/abs/2510.11852v1
- Date: Mon, 13 Oct 2025 19:05:21 GMT
- Title: Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection
- Authors: Saroj Basnet, Shafkat Farabi, Tharindu Ranasinghe, Diptesh Kanoji, Marcos Zampieri,
- Abstract summary: We evaluate seven state-of-the-art vision-language models (VLMs) on their ability to detect multimodal sarcasm.<n>We also evaluate the models' capabilities in generating explanations to sarcastic instances.
- Score: 18.11319620244252
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models' capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model's performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.
Related papers
- Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding [19.632399543819382]
Sarcasm detection remains a challenge in natural language understanding.<n>We systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English and Chinese.
arXiv Detail & Related papers (2025-09-18T22:44:27Z) - SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation [70.27631454256024]
SkillVerse is an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities.<n>Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models.
arXiv Detail & Related papers (2025-05-31T00:08:59Z) - Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models [10.47267683821842]
We propose an innovative multi-modal Commander-GPT framework for sarcasm detection.<n>Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks.<n>A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task.<n>Our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score.
arXiv Detail & Related papers (2025-03-24T13:53:00Z) - Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models [25.416060651721764]
We introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets.<n>We examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases.<n>Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts.
arXiv Detail & Related papers (2025-03-15T14:10:25Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models [14.453131020178564]
This paper proposes a versatile MSTI framework with a coarse-to-fine paradigm, by augmenting sarcasm explainability with reasoning and pre-training knowledge.
Inspired by the powerful capacity of Large Multimodal Models (LMMs) on multimodal reasoning, we first engage LMMs to generate competing rationales for coarser-grained pre-training of a small language model on multimodal sarcasm detection.
We then propose fine-tuning the model for finer-grained sarcasm target identification. Our framework is thus empowered to adeptly unveil the intricate targets within multimodal sarcasm and mitigate the negative impact posed by potential noise inherently in LMMs.
arXiv Detail & Related papers (2024-05-01T08:44:44Z) - MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System [57.650338588086186]
We introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD.
We present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives.
arXiv Detail & Related papers (2023-07-14T03:22:51Z) - How to Describe Images in a More Funny Way? Towards a Modular Approach
to Cross-Modal Sarcasm Generation [62.89586083449108]
We study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image.
CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities.
We propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation.
arXiv Detail & Related papers (2022-11-20T14:38:24Z) - Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection [76.62550719834722]
We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
arXiv Detail & Related papers (2021-10-21T07:51:56Z) - Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism [7.194040730138362]
We construct a Contras-tive-Attention-based Sarcasm Detection (ConAttSD) model, which uses an inter-modality contrastive attention mechanism to extract contrastive features for an utterance.
Our experiments on MUStARD, a benchmark multi-modal sarcasm dataset, demonstrate the effectiveness of the proposed ConAttSD model.
arXiv Detail & Related papers (2021-09-30T14:17:51Z) - $R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with
Commonsense Knowledge [51.70688120849654]
We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence.
Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm.
arXiv Detail & Related papers (2020-04-28T02:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.