MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for
Facial Expression Recognition
- URL: http://arxiv.org/abs/2401.07245v1
- Date: Sun, 14 Jan 2024 10:30:32 GMT
- Title: MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for
Facial Expression Recognition
- Authors: Fan Zhang, Xiaobao Guo, Xiaojiang Peng, Alex Kot
- Abstract summary: We introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC)
In the initial phase, we pre-train the ViT via masked image reconstruction on general images.
In the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples.
- Score: 11.820043444385432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cutting-edge research in facial expression recognition (FER) currently favors
the utilization of convolutional neural networks (CNNs) backbone which is
supervisedly pre-trained on face recognition datasets for feature extraction.
However, due to the vast scale of face recognition datasets and the high cost
associated with collecting facial labels, this pre-training paradigm incurs
significant expenses. Towards this end, we propose to pre-train vision
Transformers (ViTs) through a self-supervised approach on a mid-scale general
image dataset. In addition, when compared with the domain disparity existing
between face datasets and FER datasets, the divergence between general datasets
and FER datasets is more pronounced. Therefore, we propose a contrastive
fine-tuning approach to effectively mitigate this domain disparity.
Specifically, we introduce a novel FER training paradigm named Mask Image
pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we
pre-train the ViT via masked image reconstruction on general images.
Subsequently, in the fine-tuning stage, we introduce a mix-supervised
contrastive learning process, which enhances the model with a more extensive
range of positive samples by the mixing strategy. Through extensive experiments
conducted on three benchmark datasets, we demonstrate that our MIMIC
outperforms the previous training paradigm, showing its capability to learn
better representations. Remarkably, the results indicate that the vanilla ViT
can achieve impressive performance without the need for intricate,
auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC
exhibits no performance saturation and is superior to the current
state-of-the-art methods.
Related papers
- Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - Fiducial Focus Augmentation for Facial Landmark Detection [4.433764381081446]
We propose a novel image augmentation technique to enhance the model's understanding of facial structures.
We employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss.
Our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
arXiv Detail & Related papers (2024-02-23T01:34:00Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - FreMIM: Fourier Transform Meets Masked Image Modeling for Medical Image
Segmentation [37.465246717967595]
We present a new MIM-based framework named FreMIM for self-supervised pre-training to better accomplish medical image segmentation tasks.
Our FreMIM could consistently bring considerable improvements to model performance.
arXiv Detail & Related papers (2023-04-21T10:23:34Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems.
We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z) - Mean Embeddings with Test-Time Data Augmentation for Ensembling of
Representations [8.336315962271396]
We look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA)
MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models.
We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
arXiv Detail & Related papers (2021-06-15T10:49:46Z) - Ear2Face: Deep Biometric Modality Mapping [9.560980936110234]
We present an end-to-end deep neural network model that learns a mapping between the biometric modalities.
We formulated the problem as a paired image-to-image translation task and collected datasets of ear and face image pairs.
We have achieved very promising results, especially on the FERET dataset, generating visually appealing face images from ear image inputs.
arXiv Detail & Related papers (2020-06-02T21:14:27Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Joint Deep Learning of Facial Expression Synthesis and Recognition [97.19528464266824]
We propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER.
The proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions.
In order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm.
arXiv Detail & Related papers (2020-02-06T10:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.