Related papers: CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

URL: http://arxiv.org/abs/2303.00193v1
Date: Wed, 1 Mar 2023 02:59:55 GMT
Title: CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition
Authors: Hanting Li, Hongjing Niu, Zhaoqing Zhu, and Feng Zhao
Abstract summary: We propose a unified framework for both static and dynamic facial Expression Recognition based on CLIP. We introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable.
Score: 1.8604727699812171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Facial expression recognition (FER) is an essential task for understanding human behaviors. As one of the most informative behaviors of humans, facial expressions are often compound and variable, which is manifested by the fact that different people may express the same expression in very different ways. However, most FER methods still use one-hot or soft labels as the supervision, which lack sufficient semantic descriptions of facial expressions and are less interpretable. Recently, contrastive vision-language pre-training (VLP) models (e.g., CLIP) use text as supervision and have injected new vitality into various computer vision tasks, benefiting from the rich semantics in text. Therefore, in this work, we propose CLIPER, a unified framework for both static and dynamic facial Expression Recognition based on CLIP. Besides, we introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable. We conduct extensive experiments on several popular FER benchmarks and achieve state-of-the-art performance, which demonstrates the effectiveness of CLIPER.

Related papers

FaceInsight: A Multimodal Large Language Model for Face Perception [69.06084304620026]
We propose FaceInsight, a versatile face perception large language model (MLLM) that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs.
arXiv Detail & Related papers (2025-04-22T06:31:57Z)
Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation [66.53435569574135]
Existing facial expression recognition methods typically fine-tune a pre-trained visual encoder using discrete labels. We observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. We propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation.
arXiv Detail & Related papers (2024-09-13T07:28:57Z)
Contrastive Learning of View-Invariant Representations for Facial Expressions Recognition [27.75143621836449]
We propose ViewFX, a novel view-invariant FER framework based on contrastive learning. We test the proposed framework on two public multi-view facial expression recognition datasets.
arXiv Detail & Related papers (2023-11-12T14:05:09Z)
GaFET: Learning Geometry-aware Facial Expression Translation from In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression. We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z)
CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets. CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z)
Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face. PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems. We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z)
A Multi-resolution Approach to Expression Recognition in the Wild [9.118706387430883]
We propose a multi-resolution approach to solve the Facial Expression Recognition task. We ground our intuition on the observation that often faces images are acquired at different resolutions. To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset.
arXiv Detail & Related papers (2021-03-09T21:21:02Z)
Learning to Augment Expressions for Few-shot Fine-grained Facial Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED. It includes more than 200k images with 54 facial expressions from 119 persons. Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning. We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.