Forgery-aware Adaptive Vision Transformer for Face Forgery Detection
- URL: http://arxiv.org/abs/2309.11092v1
- Date: Wed, 20 Sep 2023 06:51:11 GMT
- Title: Forgery-aware Adaptive Vision Transformer for Face Forgery Detection
- Authors: Anwei Luo, Rizhao Cai, Chenqi Kong, Xiangui Kang, Jiwu Huang and Alex
C. Kot
- Abstract summary: We propose a Forgery-aware Adaptive Vision Transformer (FA-ViT)
In FA-ViT, the vanilla ViT's parameters are frozen to preserve its pre-trained knowledge.
Two specially designed components, the Local-aware Forgery (LFI) and the Global-aware Forgery Adaptor (GFA), are employed to adapt forgery-related knowledge.
- Score: 57.56537940216884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancement in face manipulation technologies, the importance of
face forgery detection in protecting authentication integrity becomes
increasingly evident. Previous Vision Transformer (ViT)-based detectors have
demonstrated subpar performance in cross-database evaluations, primarily
because fully fine-tuning with limited Deepfake data often leads to forgetting
pre-trained knowledge and over-fitting to data-specific ones. To circumvent
these issues, we propose a novel Forgery-aware Adaptive Vision Transformer
(FA-ViT). In FA-ViT, the vanilla ViT's parameters are frozen to preserve its
pre-trained knowledge, while two specially designed components, the Local-aware
Forgery Injector (LFI) and the Global-aware Forgery Adaptor (GFA), are employed
to adapt forgery-related knowledge. our proposed FA-ViT effectively combines
these two different types of knowledge to form the general forgery features for
detecting Deepfakes. Specifically, LFI captures local discriminative
information and incorporates these information into ViT via
Neighborhood-Preserving Cross Attention (NPCA). Simultaneously, GFA learns
adaptive knowledge in the self-attention layer, bridging the gap between the
two different domain. Furthermore, we design a novel Single Domain Pairwise
Learning (SDPL) to facilitate fine-grained information learning in FA-ViT. The
extensive experiments demonstrate that our FA-ViT achieves state-of-the-art
performance in cross-dataset evaluation and cross-manipulation scenarios, and
improves the robustness against unseen perturbations.
Related papers
- Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained transformers for deepfake detection.
We focus on their potential for improved generalization, particularly when training data is limited.
We observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - Forgery-aware Adaptive Transformer for Generalizable Synthetic Image
Detection [106.39544368711427]
We study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods.
We present a novel forgery-aware adaptive transformer approach, namely FatFormer.
Our approach tuned on 4-class ProGAN data attains an average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen diffusion models with 95% accuracy.
arXiv Detail & Related papers (2023-12-27T17:36:32Z) - FLIP: Cross-domain Face Anti-spoofing with Language Guidance [19.957293190322332]
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems.
Recent vision transformer (ViT) models have been shown to be effective for the FAS task.
We propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language.
arXiv Detail & Related papers (2023-09-28T17:53:20Z) - S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens [45.06704981913823]
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces.
We propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms.
To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR)
Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests.
arXiv Detail & Related papers (2023-09-07T22:36:22Z) - TransFace: Calibrating Transformer Training for Face Recognition from a
Data-Centric Perspective [40.521854111639094]
Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature.
However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets.
This paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM.
arXiv Detail & Related papers (2023-08-20T02:02:16Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation [54.61786380919243]
Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain.
Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations.
With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge remains unexplored in the literature.
arXiv Detail & Related papers (2021-08-12T22:37:43Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.