Related papers: CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

URL: http://arxiv.org/abs/2504.19212v1
Date: Sun, 27 Apr 2025 12:31:47 GMT
Title: CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes
Authors: Tuan Nguyen, Naseem Khan, Issa Khalil,
Abstract summary: deepfake technology threatens the integrity of digital images by enabling subtle, context-aware manipulations.<n>We propose CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities.<n>High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision.
Score: 3.2194551406014886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.

Related papers

CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention [4.359154048799454]
We propose CAMME, a framework that integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism.<n>Experiments demonstrate CAMME's superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes.
arXiv Detail & Related papers (2025-05-23T15:39:07Z)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections [50.343419243749054]
Anomaly Detection (AD) involves identifying deviations from normal data distributions. We propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets.
arXiv Detail & Related papers (2025-04-15T10:42:25Z)
Conditioned Prompt-Optimization for Continual Deepfake Detection [11.634681724245933]
This paper introduces Prompt2Guard, a novel solution for photorealistic-free continual deepfake detection of images. We leverage a prediction ensembling technique with read-only prompts, mitigating the need for multiple forward passes. Our method exploits a text-prompt conditioning tailored to deepfake detection, which we demonstrate is beneficial in our setting.
arXiv Detail & Related papers (2024-07-31T12:22:57Z)
Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images [13.089550724738436]
Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields. Their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content. This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier.
arXiv Detail & Related papers (2024-04-19T14:30:41Z)
MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection [81.59191603867586]
Sequential deepfake detection aims to identify forged facial regions with the correct sequence for recovery. The recovery of forged images requires knowledge of the manipulation model to implement inverse transformations. We propose Multi-Collaboration and Multi-Supervision Network (MMNet) that handles various spatial scales and sequential permutations in forged face images.
arXiv Detail & Related papers (2023-07-06T02:32:08Z)
Building an Invisible Shield for Your Portrait against Deepfakes [34.65356811439098]
We propose a novel framework - Integrity Encryptor, aiming to protect portraits in a proactive strategy. Our methodology involves covertly encoding messages that are closely associated with key facial attributes into authentic images. The modified facial attributes serve as a mean of detecting manipulated images through a comparison of the decoded messages.
arXiv Detail & Related papers (2023-05-22T10:01:28Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z)
ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations. We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings. We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z)
Detect and Locate: A Face Anti-Manipulation Approach with Semantic and Noise-level Supervision [67.73180660609844]
We propose a conceptually simple but effective method to efficiently detect forged faces in an image. The proposed scheme relies on a segmentation map that delivers meaningful high-level semantic information clues about the image. The proposed model achieves state-of-the-art detection accuracy and remarkable localization performance.
arXiv Detail & Related papers (2021-07-13T02:59:31Z)
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [74.19291916812921]
forged images generated by Deepfake techniques pose a serious threat to the trustworthiness of digital information. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods.
arXiv Detail & Related papers (2021-04-20T05:43:44Z)
Image Manipulation Detection by Multi-View Multi-Scale Supervision [11.319080833880307]
Key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. Our thoughts are realized by a new network which we term MVSS-Net.
arXiv Detail & Related papers (2021-04-14T13:05:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.