Related papers: MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

URL: http://arxiv.org/abs/2510.23299v1
Date: Mon, 27 Oct 2025 13:05:27 GMT
Title: MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Authors: Haochen Zhao, Yuyao Kong, Yongxiu Xu, Gaopeng Gou, Hongbo Xu, Yubin Wang, Haoliang Zhang,
Abstract summary: We introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews.<n>We propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections.<n>In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration.
Score: 12.041688144153532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.

Related papers

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models [89.89575486159795]
We introduce textbfMICON-Bench, a benchmark for multi-image context generation.<n>We propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency.<n>We also present textbfDynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations.
arXiv Detail & Related papers (2026-02-23T04:32:52Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.<n>Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z)
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model [48.547599530927926]
Synthetic images, when shared on social media, can mislead extensive audiences and erode trust in digital content.<n>We introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages.<n>We propose a new image deepfake detection, localization, and explanation framework, named SIDA.
arXiv Detail & Related papers (2024-12-05T16:12:25Z)
Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention [54.66152436050373]
We propose a Multi-view Large Reconstruction Model (M-LRM) to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner.<n>Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images.<n>Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z)
Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images. We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift [50.64474103506595]
We investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks. Character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data.
arXiv Detail & Related papers (2022-12-15T18:52:03Z)
A Novel Self-Supervised Cross-Modal Image Retrieval Method In Remote Sensing [0.0]
Cross-modal RS image retrieval methods search semantically similar images across different modalities. Existing CM-RSIR methods require annotated training images and do not concurrently address intra- and inter-modal similarity preservation and inter-modal discrepancy elimination. We introduce a novel self-supervised cross-modal image retrieval method that aims to model mutual-information between different modalities in a self-supervised manner.
arXiv Detail & Related papers (2022-02-23T11:20:24Z)
MEG: Multi-Evidence GNN for Multimodal Semantic Forensics [28.12652559292884]
Fake news often involves semantic manipulations across modalities such as image, text, location etc. Recent research has centered the problem around images, calling it image repurposing. We introduce a novel graph neural network based model for multimodal semantic forensics.
arXiv Detail & Related papers (2020-11-23T09:01:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.