Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection
- URL: http://arxiv.org/abs/2501.01184v2
- Date: Thu, 16 Jan 2025 17:11:06 GMT
- Title: Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection
- Authors: Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada,
- Abstract summary: Deepfake videos are highly challenging to detect due to the complex intertwined temporal and spatial artifacts in forged sequences.<n>Most recent approaches rely on binary classifiers trained on both real and fake data.<n>We introduce a multi-task learning framework with additional spatial and temporal branches that enable the model to focus on subtle artifacts.<n>Second, we propose a video-level data algorithm that generates pseudo-fake videos with subtle artifacts, providing the model with high-quality samples and ground truth data.
- Score: 14.586314545834934
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Detecting deepfake videos is highly challenging due to the complex intertwined spatial and temporal artifacts in forged sequences. Most recent approaches rely on binary classifiers trained on both real and fake data. However, such methods may struggle to focus on important artifacts, which can hinder their generalization capability. Additionally, these models often lack interpretability, making it difficult to understand how predictions are made. To address these issues, we propose FakeSTormer, offering two key contributions. First, we introduce a multi-task learning framework with additional spatial and temporal branches that enable the model to focus on subtle spatio-temporal artifacts. These branches also provide interpretability by highlighting video regions that may contain artifacts. Second, we propose a video-level data synthesis algorithm that generates pseudo-fake videos with subtle artifacts, providing the model with high-quality samples and ground truth data for our spatial and temporal branches. Extensive experiments on several challenging benchmarks demonstrate the competitiveness of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.
Related papers
- BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos [63.03271511550633]
BrokenVideos is a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption.<n>Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions.
arXiv Detail & Related papers (2025-06-25T03:30:04Z) - Deepfake Detection with Spatio-Temporal Consistency and Attention [46.1135899490656]
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism.<n>Current methods for detecting forged videos rely mainly on global frame features.<n>We propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos.
arXiv Detail & Related papers (2025-02-12T08:51:33Z) - Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning [41.30923253467854]
Temporal features can be complex and diverse.<n>Spatiotemporal models often lean heavily on one type of artifact and ignore the other.<n>Videos are naturally resource-intensive.
arXiv Detail & Related papers (2024-08-30T07:49:57Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - AltFreezing for More General Video Face Forgery Detection [138.5732617371004]
We propose to capture both spatial and unseen temporal artifacts in one model for face forgery detection.
We present a novel training strategy called AltFreezing for more general face forgery detection.
arXiv Detail & Related papers (2023-07-17T08:24:58Z) - Undercover Deepfakes: Detecting Fake Segments in Videos [1.2609216345578933]
deepfake generation is a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth.
In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels.
In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes.
arXiv Detail & Related papers (2023-05-11T04:43:10Z) - Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption [94.5031244215761]
We propose to boost the generalization of deepfake detection by distinguishing the "regularity disruption" that does not appear in real videos.
Specifically, by carefully examining the spatial and temporal properties, we propose to disrupt a real video through a Pseudo-fake Generator.
Such practice allows us to achieve deepfake detection without using fake videos and improves the generalization ability in a simple and efficient manner.
arXiv Detail & Related papers (2022-07-21T10:42:34Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Detection of Deepfake Videos Using Long Distance Attention [73.6659488380372]
Most existing detection methods treat the problem as a vanilla binary classification problem.
In this paper, the problem is treated as a special fine-grained classification problem since the differences between fake and real faces are very subtle.
A spatial-temporal model is proposed which has two components for capturing spatial and temporal forgery traces in global perspective.
arXiv Detail & Related papers (2021-06-24T08:33:32Z) - Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes.
We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection.
We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z) - Spatio-temporal Features for Generalized Detection of Deepfake Videos [12.453288832098314]
We propose-temporal features, modeled by 3D CNNs, to extend the capabilities to detect new sorts of deep videos.
We show that our approach outperforms existing methods in terms of generalization capabilities.
arXiv Detail & Related papers (2020-10-22T16:28:50Z) - Deepfake Detection using Spatiotemporal Convolutional Networks [0.0]
Deepfake detection methods only use individual frames and therefore fail to learn from temporal information.
We created a benchmark of performance using Celeb-DF dataset.
Our methods outperformed state-of-theart frame-based detection methods.
arXiv Detail & Related papers (2020-06-26T01:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.