Towards Unified Facial Action Unit Recognition Framework by Large Language Models
- URL: http://arxiv.org/abs/2409.08444v1
- Date: Fri, 13 Sep 2024 00:26:09 GMT
- Title: Towards Unified Facial Action Unit Recognition Framework by Large Language Models
- Authors: Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue,
- Abstract summary: We propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM)
AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM.
On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs.
- Score: 10.752099675130276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning [48.70249675019288]
We propose an end-to-end Vision-Language joint learning network for explainable facial action units (AUs) recognition.
The proposed approach achieves superior performance over the state-of-the-art methods on most metrics.
arXiv Detail & Related papers (2024-08-01T15:35:44Z) - Representation Learning and Identity Adversarial Training for Facial Behavior Understanding [3.350769246260559]
We show that subject identity provides a shortcut learning for the model and leads to sub-optimal solutions to AU predictions.
We propose Identity Adrial Training (IAT) and demonstrate that a strong IAT regularization is necessary to learn identity-invariant features.
Our proposed methods, Facial Masked Autoencoder (FMAE) and IAT, are simple, generic and effective.
arXiv Detail & Related papers (2024-07-15T21:13:28Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - AU-Supervised Convolutional Vision Transformers for Synthetic Facial
Expression Recognition [12.661683851729679]
The paper describes our proposed methodology for the six basic expression classification track of Affective Behavior Analysis in-the-wild (ABAW) Competition 2022.
Because of the ambiguous of the synthetic data and the objectivity of the facial Action Unit (AU), we resort to the AU information for performance boosting.
arXiv Detail & Related papers (2022-07-20T09:33:39Z) - Learning Multi-dimensional Edge Feature-based AU Relation Graph for
Facial Action Unit Recognition [27.34564955127377]
The activations of Facial Action Units (AUs) mutually influence one another.
Existing approaches fail to specifically and explicitly represent such cues for each pair of AUs in each facial display.
This paper proposes an AU relationship modelling approach that deep learns a unique graph to explicitly describe the relationship between each pair of AUs.
arXiv Detail & Related papers (2022-05-02T03:38:00Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - AU-Guided Unsupervised Domain Adaptive Facial Expression Recognition [21.126514122636966]
This paper proposes an AU-guided unsupervised Domain Adaptive FER framework to relieve the annotation bias between different FER datasets.
To achieve domain-invariant compact features, we utilize an AU-guided triplet training which randomly collects anchor-positive-negative triplets on both domains with AUs.
arXiv Detail & Related papers (2020-12-18T07:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.