Self-supervised Transformer for Deepfake Detection
- URL: http://arxiv.org/abs/2203.01265v1
- Date: Wed, 2 Mar 2022 17:44:40 GMT
- Title: Self-supervised Transformer for Deepfake Detection
- Authors: Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang and Nenghai Yu
- Abstract summary: Deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors.
Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection.
In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method.
- Score: 112.81127845409002
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The fast evolution and widespread of deepfake techniques in real-world
scenarios require stronger generalization abilities of face forgery detectors.
Some works capture the features that are unrelated to method-specific
artifacts, such as clues of blending boundary, accumulated up-sampling, to
strengthen the generalization ability. However, the effectiveness of these
methods can be easily corrupted by post-processing operations such as
compression. Inspired by transfer learning, neural networks pre-trained on
other large-scale face-related tasks may provide useful features for deepfake
detection. For example, lip movement has been proved to be a kind of robust and
good-transferring highlevel semantic feature, which can be learned from the
lipreading task. However, the existing method pre-trains the lip feature
extraction model in a supervised manner, which requires plenty of human
resources in data annotation and increases the difficulty of obtaining training
data. In this paper, we propose a self-supervised transformer based
audio-visual contrastive learning method. The proposed method learns mouth
motion representations by encouraging the paired video and audio
representations to be close while unpaired ones to be diverse. After
pre-training with our method, the model will then be partially fine-tuned for
deepfake detection task. Extensive experiments show that our self-supervised
method performs comparably or even better than the supervised pre-training
counterpart.
Related papers
- Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning [0.0]
Deepfake technology has raised concerns about the authenticity of digital content, necessitating the development of effective detection methods.
Adversaries can manipulate deepfake videos with small, imperceptible perturbations that can deceive the detection models into producing incorrect outputs.
We introduce Adversarial Feature Similarity Learning (AFSL), which integrates three fundamental deep feature learning paradigms.
arXiv Detail & Related papers (2024-02-06T11:35:05Z) - Segue: Side-information Guided Generative Unlearnable Examples for
Facial Privacy Protection in Real World [64.4289385463226]
We propose Segue: Side-information guided generative unlearnable examples.
To improve transferability, we introduce side information such as true labels and pseudo labels.
It can resist JPEG compression, adversarial training, and some standard data augmentations.
arXiv Detail & Related papers (2023-10-24T06:22:37Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video
Deepfake Detection [10.36919027402249]
Synthetic videos of speaking humans can be used to spread misinformation in a convincing manner.
FakeOut is a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase.
Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets.
arXiv Detail & Related papers (2022-12-01T18:56:31Z) - Deepfake Detection via Joint Unsupervised Reconstruction and Supervised
Classification [25.84902508816679]
We introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously.
This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider.
Our method achieves state-of-the-art performance on three commonly-used datasets.
arXiv Detail & Related papers (2022-11-24T05:44:26Z) - DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning [20.94569893388119]
We design a novel deepfake detection method via unsupervised contrastive learning.
We show that our method enables comparable detection performance to state-of-the-art supervised techniques.
arXiv Detail & Related papers (2021-04-23T09:48:10Z) - Towards Generalizable and Robust Face Manipulation Detection via
Bag-of-local-feature [55.47546606878931]
We propose a novel method for face manipulation detection, which can improve the generalization ability and robustness by bag-of-local-feature.
Specifically, we extend Transformers using bag-of-feature approach to encode inter-patch relationships, allowing it to learn local forgery features without any explicit supervision.
arXiv Detail & Related papers (2021-03-14T12:50:48Z) - Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery
Detection [118.37239586697139]
LipForensics is a detection approach capable of both generalising manipulations and withstanding various distortions.
It consists in first pretraining a-temporal network to perform visual speech recognition (lipreading)
A temporal network is subsequently finetuned on fixed mouth embeddings of real and forged data in order to detect fake videos based on mouth movements without over-fitting to low-level, manipulation-specific artefacts.
arXiv Detail & Related papers (2020-12-14T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.