G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing
- URL: http://arxiv.org/abs/2408.07675v1
- Date: Wed, 14 Aug 2024 17:22:41 GMT
- Title: G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing
- Authors: Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li,
- Abstract summary: In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality.
We propose the Graph Guided Video Vision Transformer, which combines faces with facial landmarks for photometric and dynamic feature fusion.
- Score: 23.325272595629773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
Related papers
- SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting [11.978551396144532]
In this paper, we propose an efficient framework for facial expression spotting.
First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input sequence within compact sliding windows.
Second, we propose SpotFormer, a multi-scale-temporal Transformer that simultaneously encodes facial-temporal relationships of the SW-MRO features for accurate frame-level probability estimation.
Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions.
arXiv Detail & Related papers (2024-07-30T13:02:08Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Diffusion Priors for Dynamic View Synthesis from Monocular Videos [59.42406064983643]
Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos.
We first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique.
We distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields.
arXiv Detail & Related papers (2024-01-10T23:26:41Z) - From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations.
We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z) - Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection [22.536129731902783]
We propose a Latemporal Spatio(LAST) approach to facilitate generalized face video detection.
We first model thetemporal patterns face videos by incorporating a lightweight CNN to extract local spatial features of each frame.
Then we learn the long-termtemporal representations in latent space videos, which should contain more clues than in pixel space.
arXiv Detail & Related papers (2023-09-09T13:40:44Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Face Detection in Extreme Conditions: A Machine-learning Approach [0.0]
Recent studies show that deep learning knowledge of strategies can acquire spectacular performance inside the identification of different gadgets and patterns.
This paper proposes a deep cascaded multi-venture framework that exploits the inherent correlation among them to boost up their performance.
In particular, my framework adopts a cascaded shape with 3 layers of cautiously designed deep convolutional networks that expect face and landmark region in a coarse-to-fine way.
arXiv Detail & Related papers (2022-01-17T05:23:22Z) - FakeTransformer: Exposing Face Forgery From Spatial-Temporal
Representation Modeled By Facial Pixel Variations [8.194624568473126]
Face forgery can attack any target, which poses a new threat to personal privacy and property security.
Inspired by the fact that the spatial coherence and temporal consistency of physiological signal are destroyed in the generated content, we attempt to find inconsistent patterns that can distinguish between real videos and synthetic videos.
arXiv Detail & Related papers (2021-11-15T08:44:52Z) - Non-Rigid Neural Radiance Fields: Reconstruction and Novel View
Synthesis of a Dynamic Scene From Monocular Video [76.19076002661157]
Non-Rigid Neural Radiance Fields (NR-NeRF) is a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes.
We show that even a single consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views.
arXiv Detail & Related papers (2020-12-22T18:46:12Z) - Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing [61.82466976737915]
Depth supervised learning has been proven as one of the most effective methods for face anti-spoofing.
We propose a new approach to detect presentation attacks from multiple frames based on two insights.
The proposed approach achieves state-of-the-art results on five benchmark datasets.
arXiv Detail & Related papers (2020-03-18T06:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.