Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
- URL: http://arxiv.org/abs/2501.01184v3
- Date: Sat, 19 Jul 2025 09:15:28 GMT
- Title: Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
- Authors: Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada,
- Abstract summary: We propose a fine-grained deepfake video detection approach called FakeSTormer.<n>Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions.<n>We also propose a video-level synthesis strategy that generates pseudo-fake videos with subtle-temporal artifacts.
- Score: 14.586314545834934
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.
Related papers
- BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos [63.03271511550633]
BrokenVideos is a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption.<n>Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions.
arXiv Detail & Related papers (2025-06-25T03:30:04Z) - Deepfake Detection with Spatio-Temporal Consistency and Attention [46.1135899490656]
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism.<n>Current methods for detecting forged videos rely mainly on global frame features.<n>We propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos.
arXiv Detail & Related papers (2025-02-12T08:51:33Z) - Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning [41.30923253467854]
Temporal features can be complex and diverse.<n>Spatiotemporal models often lean heavily on one type of artifact and ignore the other.<n>Videos are naturally resource-intensive.
arXiv Detail & Related papers (2024-08-30T07:49:57Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - AltFreezing for More General Video Face Forgery Detection [138.5732617371004]
We propose to capture both spatial and unseen temporal artifacts in one model for face forgery detection.
We present a novel training strategy called AltFreezing for more general face forgery detection.
arXiv Detail & Related papers (2023-07-17T08:24:58Z) - Undercover Deepfakes: Detecting Fake Segments in Videos [1.2609216345578933]
deepfake generation is a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth.
In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels.
In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes.
arXiv Detail & Related papers (2023-05-11T04:43:10Z) - Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption [94.5031244215761]
We propose to boost the generalization of deepfake detection by distinguishing the "regularity disruption" that does not appear in real videos.
Specifically, by carefully examining the spatial and temporal properties, we propose to disrupt a real video through a Pseudo-fake Generator.
Such practice allows us to achieve deepfake detection without using fake videos and improves the generalization ability in a simple and efficient manner.
arXiv Detail & Related papers (2022-07-21T10:42:34Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Detection of Deepfake Videos Using Long Distance Attention [73.6659488380372]
Most existing detection methods treat the problem as a vanilla binary classification problem.
In this paper, the problem is treated as a special fine-grained classification problem since the differences between fake and real faces are very subtle.
A spatial-temporal model is proposed which has two components for capturing spatial and temporal forgery traces in global perspective.
arXiv Detail & Related papers (2021-06-24T08:33:32Z) - Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes.
We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection.
We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z) - Spatio-temporal Features for Generalized Detection of Deepfake Videos [12.453288832098314]
We propose-temporal features, modeled by 3D CNNs, to extend the capabilities to detect new sorts of deep videos.
We show that our approach outperforms existing methods in terms of generalization capabilities.
arXiv Detail & Related papers (2020-10-22T16:28:50Z) - Deepfake Detection using Spatiotemporal Convolutional Networks [0.0]
Deepfake detection methods only use individual frames and therefore fail to learn from temporal information.
We created a benchmark of performance using Celeb-DF dataset.
Our methods outperformed state-of-theart frame-based detection methods.
arXiv Detail & Related papers (2020-06-26T01:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.