Transformer-based Multimodal Information Fusion for Facial Expression
Analysis
- URL: http://arxiv.org/abs/2203.12367v1
- Date: Wed, 23 Mar 2022 12:38:50 GMT
- Title: Transformer-based Multimodal Information Fusion for Facial Expression
Analysis
- Authors: Wei Zhang, Zhimeng Zhang, Feng Qiu, Suzhen Wang, Bowen Ma, Hao Zeng,
Rudong An, Yu Ding
- Abstract summary: We introduce our submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild (ABAW) that defines four competition tasks.
The available multimodal information consist of spoken words, speech prosody, and visual expression in videos.
Our work proposes four unified transformer-based network frameworks to create the fusion of the above multimodal information.
- Score: 10.548915939047305
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Facial expression analysis has been a crucial research problem in the
computer vision area. With the recent development of deep learning techniques
and large-scale in-the-wild annotated datasets, facial expression analysis is
now aimed at challenges in real world settings. In this paper, we introduce our
submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild
(ABAW) that defines four competition tasks, including expression
classification, action unit detection, valence-arousal estimation, and a
multi-task-learning. The available multimodal information consist of spoken
words, speech prosody, and visual expression in videos. Our work proposes four
unified transformer-based network frameworks to create the fusion of the above
multimodal information. The preliminary results on the official Aff-Wild2
dataset are reported and demonstrate the effectiveness of our proposed method.
Related papers
- Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities.
We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition.
Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z) - Multi Modal Facial Expression Recognition with Transformer-Based Fusion
Networks and Dynamic Sampling [1.983814021949464]
We introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer.
Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.
arXiv Detail & Related papers (2023-03-15T07:40:28Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - An Ensemble Approach for Multiple Emotion Descriptors Estimation Using
Multi-task Learning [12.589338141771385]
This paper illustrates our submission method to the fourth Affective Behavior Analysis in-the-Wild (ABAW) Competition.
Instead of using only face information, we employ full information from a provided dataset containing face and the context around the face.
The proposed system achieves the performance of 0.917 on the MTL Challenge validation dataset.
arXiv Detail & Related papers (2022-07-22T04:57:56Z) - Facial Expression Recognition with Swin Transformer [1.983814021949464]
We introduce Swin transformer-based facial expression approach for an in-the-wild audio-visual dataset of the Aff-Wild2 Expression dataset.
Specifically, we employ a three-stream network for the audio-visual videos to fuse the multi-modal information into facial expression recognition.
arXiv Detail & Related papers (2022-03-25T06:42:31Z) - Prior Aided Streaming Network for Multi-task Affective Recognitionat the
2nd ABAW2 Competition [9.188777864190204]
We introduce our submission to the 2nd Affective Behavior Analysis in-the-wild (ABAW2) Competition.
In dealing with different emotion representations, we propose a multi-task streaming network.
We leverage an advanced facial expression embedding as prior knowledge.
arXiv Detail & Related papers (2021-07-08T09:35:08Z) - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
of Vision-and-Language BERTs [57.74359320513427]
Methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI.
We study the differences between these two categories, and show how they can be unified under a single theoretical framework.
We conduct controlled experiments to discern the empirical differences between five V&L BERTs.
arXiv Detail & Related papers (2020-11-30T18:55:24Z) - Learning to Augment Expressions for Few-shot Fine-grained Facial
Expression Recognition [98.83578105374535]
We present a novel Fine-grained Facial Expression Database - F2ED.
It includes more than 200k images with 54 facial expressions from 119 persons.
Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we evaluate several tasks of few-shot expression learning.
We propose a unified task-driven framework - Compositional Generative Adversarial Network (Comp-GAN) learning to synthesize facial images.
arXiv Detail & Related papers (2020-01-17T03:26:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.