Related papers: Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

URL: http://arxiv.org/abs/2506.05890v1
Date: Fri, 06 Jun 2025 08:59:07 GMT
Title: Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
Authors: Yiheng Li, Yang Yang, Zichang Tan, Huan Liu, Weihua Chen, Xu Zhou, Zhen Lei,
Abstract summary: We propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4.<n>To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair.<n>Experiments on DGM4 prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content.
Score: 40.97921191007003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.

Related papers

Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z)
CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection [8.631593963090985]
Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and inconsistent with their diffs-known as message-code inconsistency (MCI)<n>We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs)<n>We generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples.
arXiv Detail & Related papers (2025-11-25T03:33:57Z)
Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation [24.952907733127223]
We propose a general framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD)<n>CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates mismatchs while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio)
arXiv Detail & Related papers (2025-05-21T08:11:07Z)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections [50.343419243749054]
Anomaly Detection (AD) involves identifying deviations from normal data distributions.<n>We propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder.<n>Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets.
arXiv Detail & Related papers (2025-04-15T10:42:25Z)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding [67.66032656360815]
We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4)<n>We observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding.<n>We utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs.
arXiv Detail & Related papers (2024-12-17T09:33:06Z)
Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv Detail & Related papers (2024-10-16T17:59:32Z)
Dynamic Weighted Combiner for Mixed-Modal Image Retrieval [8.683144453481328]
Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. Previous approaches always achieve limited performance, due to two critical factors. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges.
arXiv Detail & Related papers (2023-12-11T07:36:45Z)
Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4) DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content. We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z)
Inconsistent Matters: A Knowledge-guided Dual-consistency Network for Multi-modal Rumor Detection [53.48346699224921]
A novel Knowledge-guided Dualconsistency Network is proposed to detect rumors with multimedia contents. It uses two consistency detectionworks to capture the inconsistency at the cross-modal level and the content-knowledge level simultaneously. It also enables robust multi-modal representation learning under different missing visual modality conditions.
arXiv Detail & Related papers (2023-06-03T15:32:20Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.