Deepfake Detection Scheme Based on Vision Transformer and Distillation
- URL: http://arxiv.org/abs/2104.01353v1
- Date: Sat, 3 Apr 2021 09:13:05 GMT
- Title: Deepfake Detection Scheme Based on Vision Transformer and Distillation
- Authors: Young-Jin Heo, Young-Ju Choi, Young-Woon Lee, Byung-Gyu Kim
- Abstract summary: We propose a Vision Transformer model with distillation methodology for detecting fake videos.
We verify that the proposed scheme with patch embedding as input outperforms the state-of-the-art using the combined CNN features.
- Score: 4.716110829725784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deepfake is the manipulated video made with a generative deep learning
technique such as Generative Adversarial Networks (GANs) or Auto Encoder that
anyone can utilize. Recently, with the increase of Deepfake videos, some
classifiers consisting of the convolutional neural network that can distinguish
fake videos as well as deepfake datasets have been actively created. However,
the previous studies based on the CNN structure have the problem of not only
overfitting, but also considerable misjudging fake video as real ones. In this
paper, we propose a Vision Transformer model with distillation methodology for
detecting fake videos. We design that a CNN features and patch-based
positioning model learns to interact with all positions to find the artifact
region for solving false negative problem. Through comparative analysis on
Deepfake Detection (DFDC) Dataset, we verify that the proposed scheme with
patch embedding as input outperforms the state-of-the-art using the combined
CNN features. Without ensemble technique, our model obtains 0.978 of AUC and
91.9 of f1 score, while previous SOTA model yields 0.972 of AUC and 90.6 of f1
score on the same condition.
Related papers
- AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Deepfake Video Detection Using Generative Convolutional Vision
Transformer [3.8297637120486496]
We propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection.
Our model combines ConvNeXt and Swin Transformer models for feature extraction.
By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos.
arXiv Detail & Related papers (2023-07-13T19:27:40Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - Model Attribution of Face-swap Deepfake Videos [39.771800841412414]
We first introduce a new dataset with DeepFakes from Different Models (DFDM) based on several Autoencoder models.
Specifically, five generation models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio have been used to generate a total of 6,450 Deepfake videos.
We take Deepfakes model attribution as a multiclass classification task and propose a spatial and temporal attention based method to explore the differences among Deepfakes.
arXiv Detail & Related papers (2022-02-25T20:05:18Z) - Combining EfficientNet and Vision Transformers for Video Deepfake
Detection [6.365889364810238]
Deepfakes are the result of digital manipulation to obtain credible videos in order to deceive the viewer.
In this study, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor.
The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC)
arXiv Detail & Related papers (2021-07-06T13:35:11Z) - Deepfake Video Detection Using Convolutional Vision Transformer [0.0]
Deep learning techniques can generate and synthesis hyper-realistic videos known as Deepfakes.
Deepfakes pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam.
We propose a Convolutional Vision Transformer for the detection of Deepfakes.
arXiv Detail & Related papers (2021-02-22T15:56:05Z) - Adversarially robust deepfake media detection using fused convolutional
neural network predictions [79.00202519223662]
Current deepfake detection systems struggle against unseen data.
We employ three different deep Convolutional Neural Network (CNN) models to classify fake and real images extracted from videos.
The proposed technique outperforms state-of-the-art models with 96.5% accuracy.
arXiv Detail & Related papers (2021-02-11T11:28:00Z) - Sharp Multiple Instance Learning for DeepFake Video Detection [54.12548421282696]
We introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated.
A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction.
Experiments on FFPMS and widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection.
arXiv Detail & Related papers (2020-08-11T08:52:17Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.