Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
- URL: http://arxiv.org/abs/2510.10663v1
- Date: Sun, 12 Oct 2025 15:38:03 GMT
- Title: Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
- Authors: Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren,
- Abstract summary: We make the first attempt and propose FS-VFM to learn fundamental representations of real face images.<n>We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID)<n>We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global correspondence.<n>Experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs.
- Score: 23.328598687742712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
Related papers
- FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models [51.858371492494456]
Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks.<n>There is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task.<n>We propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning datasets.<n>Our instruction datasets, protocols, and codes will be released soon.
arXiv Detail & Related papers (2025-05-14T14:10:43Z) - Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z) - FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning [27.34249750803211]
We propose a self-supervised pretraining framework to learn fundamental representations of real face images.<n>Our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
arXiv Detail & Related papers (2024-12-16T17:58:45Z) - ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data.
We introduce a diffusion-fueled SFR model termed $textID3$.
$textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z) - MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - FaceCat: Enhancing Face Recognition Security with a Unified Diffusion Model [30.0523477092216]
Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems.
This paper aims to achieve this goal by breaking through two primary obstacles: 1) the suboptimal face feature representation and 2) the scarcity of training data.
arXiv Detail & Related papers (2024-04-14T09:01:26Z) - FLIP: Cross-domain Face Anti-spoofing with Language Guidance [19.957293190322332]
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems.
Recent vision transformer (ViT) models have been shown to be effective for the FAS task.
We propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language.
arXiv Detail & Related papers (2023-09-28T17:53:20Z) - TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security [56.24794071698785]
Face Recognition (FR) technology has made significant strides with the emergence of deep learning.<n>Most existing FR models are built upon Convolutional Neural Networks (CNN) and take RGB face images as the model's input.<n>We propose two novel FR frameworks, i.e., TransFace and TransFace++, which successfully explore the feasibility of applying ViTs and image bytes to FR tasks.
arXiv Detail & Related papers (2023-08-20T02:02:16Z) - DotFAN: A Domain-transferred Face Augmentation Network for Pose and
Illumination Invariant Face Recognition [94.96686189033869]
We propose a 3D model-assisted domain-transferred face augmentation network (DotFAN)
DotFAN can generate a series of variants of an input face based on the knowledge distilled from existing rich face datasets collected from other domains.
Experiments show that DotFAN is beneficial for augmenting small face datasets to improve their within-class diversity.
arXiv Detail & Related papers (2020-02-23T08:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.