CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial
Expression Recognition
- URL: http://arxiv.org/abs/2303.00193v1
- Date: Wed, 1 Mar 2023 02:59:55 GMT
- Title: CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial
Expression Recognition
- Authors: Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao
- Abstract summary: We propose a unified framework for both static and dynamic facial Expression Recognition based on CLIP.
We introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable.
- Score: 1.8604727699812171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression recognition (FER) is an essential task for understanding
human behaviors. As one of the most informative behaviors of humans, facial
expressions are often compound and variable, which is manifested by the fact
that different people may express the same expression in very different ways.
However, most FER methods still use one-hot or soft labels as the supervision,
which lack sufficient semantic descriptions of facial expressions and are less
interpretable. Recently, contrastive vision-language pre-training (VLP) models
(e.g., CLIP) use text as supervision and have injected new vitality into
various computer vision tasks, benefiting from the rich semantics in text.
Therefore, in this work, we propose CLIPER, a unified framework for both static
and dynamic facial Expression Recognition based on CLIP. Besides, we introduce
multiple expression text descriptors (METD) to learn fine-grained expression
representations that make CLIPER more interpretable. We conduct extensive
experiments on several popular FER benchmarks and achieve state-of-the-art
performance, which demonstrates the effectiveness of CLIPER.
Related papers
- Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting [28.673734895558322]
We introduce a challenging Open-set Video-based Facial Expression Recognition task, aiming at identifying unknown human facial expressions.
Existing approaches address open-set recognition by leveraging large-scale vision-language models like CLIP to identify unseen classes.
We propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively.
arXiv Detail & Related papers (2024-04-26T01:21:08Z) - Contrastive Learning of View-Invariant Representations for Facial
Expressions Recognition [27.75143621836449]
We propose ViewFX, a novel view-invariant FER framework based on contrastive learning.
We test the proposed framework on two public multi-view facial expression recognition datasets.
arXiv Detail & Related papers (2023-11-12T14:05:09Z) - GaFET: Learning Geometry-aware Facial Expression Translation from
In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression.
We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z) - CLIP also Understands Text: Prompting CLIP for Phrase Understanding [65.59857372525664]
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision.
In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt.
arXiv Detail & Related papers (2022-10-11T23:35:18Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial
Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets.
CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems.
We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z) - A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task.
We ground our intuition on the observation that often faces images are acquired at different resolutions.
To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z) - Learning to Augment Expressions for Few-shot Fine-grained Facial
Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED.
It includes more than 200k images with 54 facial expressions from 119 persons.
Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning.
We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.