Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
Anti-Spoofing
- URL: http://arxiv.org/abs/2302.05744v1
- Date: Sat, 11 Feb 2023 17:02:34 GMT
- Title: Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
Anti-Spoofing
- Authors: Zitong Yu, Rizhao Cai, Yawen Cui, Xin Liu, Yongjian Hu, Alex Kot
- Abstract summary: We investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth.
We propose the modality-asymmetric masked autoencoder (M$2$A$2$E) for multimodal FAS self-supervised pre-training without costly annotated labels.
- Score: 19.142582966452935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision transformer (ViT) based multimodal learning methods have
been proposed to improve the robustness of face anti-spoofing (FAS) systems.
However, there are still no works to explore the fundamental natures
(\textit{e.g.}, modality-aware inputs, suitable multimodal pre-training, and
efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we
investigate three key factors (i.e., inputs, pre-training, and finetuning) in
ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of
the ViT inputs, we find that leveraging local feature descriptors benefits the
ViT on IR modality but not RGB or Depth modalities. Second, in observation of
the inefficiency on direct finetuning the whole or partial ViT, we design an
adaptive multimodal adapter (AMA), which can efficiently aggregate local
multimodal features while freezing majority of ViT parameters. Finally, in
consideration of the task (FAS vs. generic object classification) and modality
(multimodal vs. unimodal) gaps, ImageNet pre-trained models might be
sub-optimal for the multimodal FAS task. To bridge these gaps, we propose the
modality-asymmetric masked autoencoder (M$^{2}$A$^{2}$E) for multimodal FAS
self-supervised pre-training without costly annotated labels. Compared with the
previous modality-symmetric autoencoder, the proposed M$^{2}$A$^{2}$E is able
to learn more intrinsic task-aware representation and compatible with
modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings.
Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal
(RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal
FAS benchmarks demonstrate the superior performance of the proposed methods. We
hope these findings and solutions can facilitate the future research for
ViT-based multimodal FAS.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - Visual Prompt Multi-Modal Tracking [71.53972967568251]
Visual Prompt multi-modal Tracking (ViPT) learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks.
ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking.
arXiv Detail & Related papers (2023-03-20T01:51:07Z) - Flexible-Modal Face Anti-Spoofing: A Benchmark [66.18359076810549]
Face anti-spoofing (FAS) plays a vital role in securing face recognition systems from presentation attacks.
We establish the first flexible-modal FAS benchmark with the principle train one for all'
We also investigate prevalent deep models and feature fusion strategies for flexible-modal FAS.
arXiv Detail & Related papers (2022-02-16T16:55:39Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.