Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
- URL: http://arxiv.org/abs/2509.12653v1
- Date: Tue, 16 Sep 2025 04:18:48 GMT
- Title: Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
- Authors: Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong,
- Abstract summary: We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
- Score: 56.816929931908824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06\% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
Related papers
- VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference [53.31458914370742]
VizDefender is a framework for tampering detection and analysis.<n>The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation.
arXiv Detail & Related papers (2025-12-21T18:44:03Z) - A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised [67.61878540090116]
We propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection.<n>First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight.<n>Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction.
arXiv Detail & Related papers (2025-09-21T22:30:32Z) - Unmasking Synthetic Realities in Generative AI: A Comprehensive Review of Adversarially Robust Deepfake Detection Systems [4.359154048799454]
Deepfake proliferation-synthetic media poses challenges to digital security, misinformation mitigation, and identity preservation.<n>This systematic review evaluates state-of-the-art deepfake detection methodologies, emphasizing reproducible implementations for transparency and validation.<n>We delineate two core paradigms: (1) detection of fully synthetic media leveraging statistical anomalies and hierarchical feature extraction, and (2) localization of manipulated regions within authentic content employing multi-modal cues such as visual artifacts and temporal inconsistencies.
arXiv Detail & Related papers (2025-07-24T22:05:52Z) - Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation [40.97921191007003]
We propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4.<n>To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair.<n>Experiments on DGM4 prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content.
arXiv Detail & Related papers (2025-06-06T08:59:07Z) - The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts [17.31556625041178]
multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation.<n>We propose a new adversarial pipeline that MLLMs to generate high-risk disinformation.<n>We present the Artifact-aware Manipulation Diagnosis Diagnosis via MLLM framework.
arXiv Detail & Related papers (2025-05-23T04:58:27Z) - ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding [67.66032656360815]
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4)<n>We observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding.<n>We utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs.
arXiv Detail & Related papers (2024-12-17T09:33:06Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake
Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos.
We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.