ViTransPAD: Video Transformer using convolution and self-attention for
Face Presentation Attack Detection
- URL: http://arxiv.org/abs/2203.01562v1
- Date: Thu, 3 Mar 2022 08:23:20 GMT
- Title: ViTransPAD: Video Transformer using convolution and self-attention for
Face Presentation Attack Detection
- Authors: Zuheng Ming, Zitong Yu, Musab Al-Ghadi, Muriel Visani, Muhammad
MuzzamilLuqman, Jean-Christophe Burie
- Abstract summary: Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems.
Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary task without considering the context.
We propose a Video-based Transformer for face PAD (ViTransPAD) with shorttemporal/range-attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames.
- Score: 15.70621878093133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Face Presentation Attack Detection (PAD) is an important measure to prevent
spoof attacks for face biometric systems. Many works based on Convolution
Neural Networks (CNNs) for face PAD formulate the problem as an image-level
binary classification task without considering the context. Alternatively,
Vision Transformers (ViT) using self-attention to attend the context of an
image become the mainstreams in face PAD. Inspired by ViT, we propose a
Video-based Transformer for face PAD (ViTransPAD) with short/long-range
spatio-temporal attention which can not only focus on local details with short
attention within a frame but also capture long-range dependencies over frames.
Instead of using coarse image patches with single-scale as in ViT, we propose
the Multi-scale Multi-Head Self-Attention (MsMHSA) architecture to accommodate
multi-scale patch partitions of Q, K, V feature maps to the heads of
transformer in a coarse-to-fine manner, which enables to learn a fine-grained
representation to perform pixel-level discrimination for face PAD. Due to lack
inductive biases of convolutions in pure transformers, we also introduce
convolutions to the proposed ViTransPAD to integrate the desirable properties
of CNNs by using convolution patch embedding and convolution projection. The
extensive experiments show the effectiveness of our proposed ViTransPAD with a
preferable accuracy-computation balance, which can serve as a new backbone for
face PAD.
Related papers
- Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - DeViT: Deformed Vision Transformers in Video Inpainting [59.73019717323264]
We extend previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH)
Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching.
Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens.
arXiv Detail & Related papers (2022-09-28T08:57:14Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.