ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation
- URL: http://arxiv.org/abs/2511.14259v2
- Date: Tue, 25 Nov 2025 05:37:05 GMT
- Title: ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation
- Authors: Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, Guangtao Zhai,
- Abstract summary: We present textbfManipBench,' a large-scale benchmark for image manipulation detection and localization.<n>We also propose textbfManipShield,' an all-in-one model based on a Multimodal Large Language Model (MLLM) to achieve unified image manipulation detection, localization, and explanation.
- Score: 81.52606410224136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
Related papers
- Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features [4.83420384410068]
Current deep learning-based manipulation detection methods excel in achieving high image-level classification accuracy.<n>The absence of pixel-wise annotations in real-world scenarios limits the existing fully-supervised manipulation localization techniques.<n>We propose a novel weakly-supervised approach that integrates activation maps generated by image-level manipulation detection networks with segmentation maps from pre-trained models.
arXiv Detail & Related papers (2025-05-29T15:58:29Z) - CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes [3.2194551406014886]
deepfake technology threatens the integrity of digital images by enabling subtle, context-aware manipulations.<n>We propose CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities.<n>High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision.
arXiv Detail & Related papers (2025-04-27T12:31:47Z) - DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba.<n>By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details.<n>Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z) - Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement [52.15627062770557]
Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods.<n>Recent approaches in image manipulation detection have largely been driven by fully supervised approaches.<n>We present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture.
arXiv Detail & Related papers (2025-03-26T07:35:09Z) - EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM [50.054404519821745]
We present a novel framework that integrates a multimodal Large Language Model for enhanced reasoning capabilities.<n>Our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush datasets.<n> Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits.
arXiv Detail & Related papers (2024-12-05T02:05:33Z) - FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models [16.737419222106308]
FakeShield is a framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues.<n>In experiments, FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
arXiv Detail & Related papers (2024-10-03T17:59:34Z) - Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks.<n>We provide an exhaustive overview of existing methods using diffusion models for image editing.<n>To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z) - MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task
Image Manipulation Detection and Localization [1.14219428942199]
A novel multi-scale multi-grained deep network (MSMG-Net) is proposed to automatically identify manipulated regions.
In our MSMG-Net, a parallel multi-scale feature extraction structure is used to extract multi-scale features.
The MSMG-Net can effectively perceive the object-level semantics and encode the edge artifact.
arXiv Detail & Related papers (2022-11-06T14:58:21Z) - ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations.
We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings.
We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z) - Swapping Autoencoder for Deep Image Manipulation [94.33114146172606]
We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation.
The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image.
Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.
arXiv Detail & Related papers (2020-07-01T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.