Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
- URL: http://arxiv.org/abs/2505.15489v2
- Date: Mon, 26 May 2025 17:51:52 GMT
- Title: Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
- Authors: Jiaying Wu, Fanxiao Li, Min-Yen Kan, Bryan Hooi,
- Abstract summary: We introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent.<n>DeceptionDecoded is a benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles.<n>We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks.
- Score: 48.2311603411121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The real-world impact of misinformation stems from the underlying misleading narratives that creators seek to convey. As such, interpreting misleading creator intent is essential for multimodal misinformation detection (MMD) systems aimed at effective information governance. In this paper, we introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent through two components: the desired influence and the execution plan. Using this framework, we construct DeceptionDecoded, a large-scale benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles. The dataset captures both misleading and non-misleading intents and spans manipulations across visual and textual modalities. We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. Despite recent advances, we observe that current VLMs fall short in recognizing misleading intent, often relying on spurious cues such as superficial cross-modal consistency, stylistic signals, and heuristic authenticity hints. Our findings highlight the pressing need for intent-aware modeling in MMD and open new directions for developing systems capable of deeper reasoning about multimodal misinformation.
Related papers
- Intent Representation Learning with Large Language Model for Recommendation [11.118517297006894]
We propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), to construct multimodal intents and enhance recommendations.<n>Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations.<n>To better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations.
arXiv Detail & Related papers (2025-02-05T16:08:05Z) - Dynamic Analysis and Adaptive Discriminator for Fake News Detection [59.41431561403343]
We propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection.<n>For knowledge-based methods, we introduce the Monte Carlo Tree Search algorithm to leverage the self-reflective capabilities of large language models.<n>For semantic-based methods, we define four typical deceit patterns to reveal the mechanisms behind fake news creation.
arXiv Detail & Related papers (2024-08-20T14:13:54Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Inconsistent Matters: A Knowledge-guided Dual-consistency Network for
Multi-modal Rumor Detection [53.48346699224921]
A novel Knowledge-guided Dualconsistency Network is proposed to detect rumors with multimedia contents.
It uses two consistency detectionworks to capture the inconsistency at the cross-modal level and the content-knowledge level simultaneously.
It also enables robust multi-modal representation learning under different missing visual modality conditions.
arXiv Detail & Related papers (2023-06-03T15:32:20Z) - Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model [16.348950072491697]
Misinformation creators now more tend to use out-of- multimedia contents to deceive the public and fake news detection systems.
This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information.
In this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions.
arXiv Detail & Related papers (2023-04-15T21:11:55Z) - Detecting and Grounding Multi-Modal Media Manipulation [32.34908534582532]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-04-05T16:20:40Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Knowledge-enriched Attention Network with Group-wise Semantic for Visual
Storytelling [39.59158974352266]
Visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images.
Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images.
To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed.
arXiv Detail & Related papers (2022-03-10T12:55:47Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.