Fugu-MT 論文翻訳(概要): Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

論文の概要: Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

arxiv url: http://arxiv.org/abs/2302.05744v1
Date: Sat, 11 Feb 2023 17:02:34 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-14 18:53:04.050848
Title: Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing
Title（参考訳）: マルチモーダル面スプーフィングにおける視覚トランスフォーマーとマスクオートエンコーダの再考
Authors: Zitong Yu, Rizhao Cai, Yawen Cui, Xin Liu, Yongjian Hu, Alex Kot
Abstract要約: RGB、赤外線(IR)、深度によるマルチモーダルFASのためのViTにおける3つの重要な要素(入力、事前学習、微調整)について検討した。マルチモーダルFAS自己教師型事前学習のためのモダリティ非対称マスク付きオートエンコーダ (M$2$A$2$E) を提案する。
参考スコア（独自算出の注目度）: 19.142582966452935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (\textit{e.g.}, modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors benefits the ViT on IR modality but not RGB or Depth modalities. Second, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. Finally, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M$^{2}$A$^{2}$E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M$^{2}$A$^{2}$E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.
Abstract（参考訳）: 近年,face anti-spoofing (fas) システムのロバスト性を改善するために視覚トランスフォーマー (vit) を用いたマルチモーダル学習法が提案されている。しかしながら、バニラ ViT の基本的な性質 (\textit{e.g.}, modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) をマルチモーダル FAS に対して探索する作業は未だ存在しない。本稿では,vitにおけるrgb,赤外線(ir),奥行きを持つマルチモーダルfasの入力,事前学習,微調整の3つの重要な要因について検討する。まず、VT入力の点から、局所的な特徴記述子を活用することで、RGBやDepthモダリティではなく、IRモダリティでVTを活用できることが分かる。次に,VTパラメータの大部分を凍結しながら,局所的なマルチモーダル特徴を効率的に集約する適応型マルチモーダルアダプタ (AMA) を設計した。最後に、タスク(FAS対ジェネリックオブジェクト分類)とモダリティ(マルチモーダル対アンモダル)のギャップを考慮すると、ImageNet事前学習モデルはマルチモーダルFASタスクに準最適かもしれない。これらのギャップを埋めるために,多モードFAS自己教師型事前学習のためのモダリティ非対称マスク付きオートエンコーダ (M$^{2}$A$^{2}$E) を提案する。従来のモダリティ対称オートエンコーダと比較して、提案されたM$^{2}$A$^{2}$Eは、より本質的なタスク認識表現を学習することができ、モダリティ非依存(例えば、unimodal、bimodal、trimodal)の下流設定と互換性がある。マルチモーダルFASベンチマークで実施したユニモーダル(RGB,Depth,IR)とマルチモーダル(RGB+Depth,RGB+IR,Depth+IR,RGB+Depth+IR)の併用実験により,提案手法の優れた性能を示した。これらの発見と解決策が、ViTベースのマルチモーダルFASの今後の研究を促進することを願っている。

論文の概要: Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

関連論文リスト