Prompting Visual-Language Models for Dynamic Facial Expression
Recognition
- URL: http://arxiv.org/abs/2308.13382v2
- Date: Sat, 14 Oct 2023 23:20:04 GMT
- Title: Prompting Visual-Language Models for Dynamic Facial Expression
Recognition
- Authors: Zengqun Zhao, Ioannis Patras
- Abstract summary: This paper presents a novel visual-language model called DFER-CLIP.
It is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition.
It achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.
- Score: 14.783257517376041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel visual-language model called DFER-CLIP, which is
based on the CLIP model and designed for in-the-wild Dynamic Facial Expression
Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual
part and a textual part. For the visual part, based on the CLIP image encoder,
a temporal model consisting of several Transformer encoders is introduced for
extracting temporal facial expression features, and the final feature embedding
is obtained as a learnable "class" token. For the textual part, we use as
inputs textual descriptions of the facial behaviour that is related to the
classes (facial expressions) that we are interested in recognising -- those
descriptions are generated using large language models, like ChatGPT. This, in
contrast to works that use only the class names and more accurately captures
the relationship between them. Alongside the textual description, we introduce
a learnable token which helps the model learn relevant context information for
each expression during training. Extensive experiments demonstrate the
effectiveness of the proposed method and show that our DFER-CLIP also achieves
state-of-the-art results compared with the current supervised DFER methods on
the DFEW, FERV39k, and MAFW benchmarks. Code is publicly available at
https://github.com/zengqunzhao/DFER-CLIP.
Related papers
- Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP.
We propose SDSGG, a scene-specific description based OVSGG framework.
To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting [28.673734895558322]
We introduce a challenging Open-set Video-based Facial Expression Recognition task, aiming to identify both known and new, unseen facial expressions.
Existing approaches use large-scale vision-language models like CLIP to identify unseen classes.
We propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively.
arXiv Detail & Related papers (2024-04-26T01:21:08Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in
Situation Recognition [20.000253437661]
Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb.
We leverage the CLIP foundational model that has learned the context of images via language descriptions.
Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1% on semantic role labelling.
arXiv Detail & Related papers (2023-07-02T15:05:15Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial
Expression Recognition [1.8604727699812171]
We propose a unified framework for both static and dynamic facial Expression Recognition based on CLIP.
We introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable.
arXiv Detail & Related papers (2023-03-01T02:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.