FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
- URL: http://arxiv.org/abs/2305.03277v1
- Date: Fri, 5 May 2023 04:28:48 GMT
- Title: FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
- Authors: Ajian Liu, Zichang Tan, Zitong Yu, Chenxu Zhao, Jun Wan, Yanyan Liang,
Zhen Lei, Du Zhang, Stan Z. Li, Guodong Guo
- Abstract summary: We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
- Score: 88.6654909354382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of handy multi-modal (i.e., RGB-D) sensors has brought about
a surge of face anti-spoofing research. However, the current multi-modal face
presentation attack detection (PAD) has two defects: (1) The framework based on
multi-modal fusion requires providing modalities consistent with the training
input, which seriously limits the deployment scenario. (2) The performance of
ConvNet-based model on high fidelity datasets is increasingly limited. In this
work, we present a pure transformer-based framework, dubbed the Flexible Modal
Vision Transformer (FM-ViT), for face anti-spoofing to flexibly target any
single-modal (i.e., RGB) attack scenarios with the help of available
multi-modal data. Specifically, FM-ViT retains a specific branch for each
modality to capture different modal information and introduces the Cross-Modal
Transformer Block (CMTB), which consists of two cascaded attentions named
Multi-headed Mutual-Attention (MMA) and Fusion-Attention (MFA) to guide each
modal branch to mine potential features from informative patch tokens, and to
learn modality-agnostic liveness features by enriching the modal information of
own CLS token, respectively. Experiments demonstrate that the single model
trained based on FM-ViT can not only flexibly evaluate different modal samples,
but also outperforms existing single-modal frameworks by a large margin, and
approaches the multi-modal frameworks introduced with smaller FLOPs and model
parameters.
Related papers
- Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Visual Prompt Flexible-Modal Face Anti-Spoofing [23.58674017653937]
multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors.
We propose flexible-modal FAS, which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task.
experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework.
arXiv Detail & Related papers (2023-07-26T05:06:41Z) - MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing [3.3031006227198003]
We present Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data.
Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples.
Experiments demonstrate that the single model trained on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-04-15T13:03:44Z) - Flexible-Modal Face Anti-Spoofing: A Benchmark [66.18359076810549]
Face anti-spoofing (FAS) plays a vital role in securing face recognition systems from presentation attacks.
We establish the first flexible-modal FAS benchmark with the principle train one for all'
We also investigate prevalent deep models and feature fusion strategies for flexible-modal FAS.
arXiv Detail & Related papers (2022-02-16T16:55:39Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.