Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection
- URL: http://arxiv.org/abs/2306.06881v2
- Date: Fri, 9 Feb 2024 12:25:03 GMT
- Title: Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for
Enhanced Video Forgery Detection
- Authors: Sayantan Das, Mojtaba Kolahdouzi, Levent \"Ozparlak, Will Hickie, Ali
Etemad
- Abstract summary: We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup.
Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames.
- Score: 19.432851794777754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel approach for the detection of deepfake videos using a pair
of vision transformers pre-trained by a self-supervised masked autoencoding
setup. Our method consists of two distinct components, one of which focuses on
learning spatial information from individual RGB frames of the video, while the
other learns temporal consistency information from optical flow fields
generated from consecutive frames. Unlike most approaches where pre-training is
performed on a generic large corpus of images, we show that by pre-training on
smaller face-related datasets, namely Celeb-A (for the spatial learning
component) and YouTube Faces (for the temporal learning component), strong
results can be obtained. We perform various experiments to evaluate the
performance of our method on commonly used datasets namely FaceForensics++ (Low
Quality and High Quality, along with a new highly compressed version named Very
Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets
a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains
competitive results on Celeb-DFv2. Moreover, our method outperforms other
methods in the area in a cross-dataset setup where we fine-tune our model on
FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset
generalization ability.
Related papers
- Pre-training for Action Recognition with Automatically Generated Fractal Datasets [23.686476742398973]
We present methods to automatically produce large-scale datasets of short synthetic video clips.
The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures.
Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets.
arXiv Detail & Related papers (2024-11-26T16:51:11Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - Self-supervised Video-centralised Transformer for Video Face Clustering [58.12996668434134]
This paper presents a novel method for face clustering in videos using a video-centralised transformer.
We release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering.
arXiv Detail & Related papers (2022-03-24T16:38:54Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Two-branch Recurrent Network for Isolating Deepfakes in Videos [17.59209853264258]
We present a method for deepfake detection based on a two-branch network structure.
One branch propagates the original information, while the other branch suppresses the face content.
Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks.
arXiv Detail & Related papers (2020-08-08T01:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.