Related papers: Multi-modal Deepfake Detection and Localization with FPN-Transformer

Multi-modal Deepfake Detection and Localization with FPN-Transformer

URL: http://arxiv.org/abs/2511.08031v1
Date: Wed, 12 Nov 2025 01:35:16 GMT
Title: Multi-modal Deepfake Detection and Localization with FPN-Transformer
Authors: Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi, Chenhao Lin, Chao Shen,
Abstract summary: We introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer)<n>A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies.<n>We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535.
Score: 21.022230340898556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL

Related papers

Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization [37.361231344742045]
We propose a single-stage training framework that enhances generalization.<n>We introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames.<n>Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.
arXiv Detail & Related papers (2025-11-13T11:34:03Z)
CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection [0.0]
CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies.<n>We propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features.<n>We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets.
arXiv Detail & Related papers (2025-06-26T18:51:17Z)
DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization [13.840950434728533]
DiMoDif is an audio-visual deepfake detection framework.<n>It exploits the inter-modality differences in machine perception of speech.<n>It temporally localizes the deepfake forgery.
arXiv Detail & Related papers (2024-11-15T13:47:33Z)
Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z)
Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection [33.20064862916194]
This paper aims to learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. We present the Cross-Modal Deepfake dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes.
arXiv Detail & Related papers (2024-04-30T00:25:44Z)
Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z)
Delving into Sequential Patches for Deepfake Detection [64.19468088546743]
Recent advances in face forgery techniques produce nearly untraceable deepfake videos, which could be leveraged with malicious intentions. Previous studies has identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods. We propose the Local- & Temporal-aware Transformer-based Deepfake Detection framework, which adopts a local-to-global learning protocol.
arXiv Detail & Related papers (2022-07-06T16:46:30Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
Depthwise Non-local Module for Fast Salient Object Detection Using a Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection. The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.