Unimodal Face Classification with Multimodal Training
- URL: http://arxiv.org/abs/2112.04182v1
- Date: Wed, 8 Dec 2021 09:12:47 GMT
- Title: Unimodal Face Classification with Multimodal Training
- Authors: Wenbin Teng and Chongyang Bai
- Abstract summary: We propose a Multimodal Training Unimodal Test (MTUT) framework for robust face classification.
Our framework exploits the cross-modality relationship during training and applies it as a complementary of the imperfect single modality input during testing.
We experimentally show that our MTUT framework consistently outperforms ten baselines on 2D and 3D settings of both datasets.
- Score: 1.9580473532948401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Face recognition is a crucial task in various multimedia applications such as
security check, credential access and motion sensing games. However, the task
is challenging when an input face is noisy (e.g. poor-condition RGB image) or
lacks certain information (e.g. 3D face without color). In this work, we
propose a Multimodal Training Unimodal Test (MTUT) framework for robust face
classification, which exploits the cross-modality relationship during training
and applies it as a complementary of the imperfect single modality input during
testing. Technically, during training, the framework (1) builds both
intra-modality and cross-modality autoencoders with the aid of facial
attributes to learn latent embeddings as multimodal descriptors, (2) proposes a
novel multimodal embedding divergence loss to align the heterogeneous features
from different modalities, which also adaptively avoids the useless modality
(if any) from confusing the model. This way, the learned autoencoders can
generate robust embeddings in single-modality face classification on test
stage. We evaluate our framework in two face classification datasets and two
kinds of testing input: (1) poor-condition image and (2) point cloud or 3D face
mesh, when both 2D and 3D modalities are available for training. We
experimentally show that our MTUT framework consistently outperforms ten
baselines on 2D and 3D settings of both datasets.
Related papers
- Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications.
We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames.
Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs)
Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z) - DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency.
The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components.
Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z) - M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding [5.989397492717352]
We present M$3$3D ($underlineM$ulti-$underlineM$odal $underlineM$asked $underline3D$) built based on Multi-modal masked autoencoders.
We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning.
Experiments show that M$3$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR.
arXiv Detail & Related papers (2023-09-26T23:52:09Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z) - Recurrent Multi-view Alignment Network for Unsupervised Surface
Registration [79.72086524370819]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data.
We propose to represent the non-rigid transformation with a point-wise combination of several rigid transformations.
We also introduce a differentiable loss function that measures the 3D shape similarity on the projected multi-view 2D depth images.
arXiv Detail & Related papers (2020-11-24T14:22:42Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme.
Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.