Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant
Features
- URL: http://arxiv.org/abs/2312.05265v1
- Date: Wed, 6 Dec 2023 08:58:11 GMT
- Title: Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant
Features
- Authors: Anderson Augusma (M-PSI, SVH), Dominique Vaufreydaz (M-PSI),
Fr\'ed\'erique Letu\'e (SVH)
- Abstract summary: Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics.
This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores privacy-compliant group-level emotion recognition
''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion
recognition can be useful in many fields including social robotics,
conversational agents, e-coaching and learning analytics. This research imposes
itself using only global features avoiding individual ones, i.e. all features
that can be used to identify or track people in videos (facial landmarks, body
poses, audio diarization, etc.). The proposed multimodal model is composed of a
video and an audio branches with a cross-attention between modalities. The
video branch is based on a fine-tuned ViT architecture. The audio branch
extracts Mel-spectrograms and feed them through CNN blocks into a transformer
encoder. Our training paradigm includes a generated synthetic dataset to
increase the sensitivity of our model on facial expression within the image in
a data-driven way. The extensive experiments show the significance of our
methodology. Our privacy-compliant proposal performs fairly on the EmotiW
challenge, with 79.24% and 75.13% of accuracy respectively on validation and
test set for the best models. Noticeably, our findings highlight that it is
possible to reach this accuracy level with privacy-compliant features using
only 5 frames uniformly distributed on the video.
Related papers
- A Multimodal Framework for Deepfake Detection [0.0]
Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality.
Our research addresses the critical issue of deepfakes through an innovative multimodal approach.
Our framework combines visual and auditory analyses, yielding an accuracy of 94%.
arXiv Detail & Related papers (2024-10-04T14:59:10Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Affective Behaviour Analysis via Integrating Multi-Modal Knowledge [24.74463315135503]
The 6th competition on Affective Behavior Analysis in-the-wild (ABAW) utilizes the Aff-Wild2, Hume-Vidmimic2, and C-EXPR-DB datasets.
We present our method designs for the five competitive tracks, i.e., Valence-Arousal (VA) Estimation, Expression (EXPR) Recognition, Action Unit (AU) Detection, Compound Expression (CE) Recognition, and Emotional Mimicry Intensity (EMI) Estimation.
arXiv Detail & Related papers (2024-03-16T06:26:43Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - Mutilmodal Feature Extraction and Attention-based Fusion for Emotion
Estimation in Videos [16.28109151595872]
We introduce our submission to the CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW)
We exploited multimodal features extracted from video of different lengths from the competition dataset, including audio, pose and images.
Our system achieves the performance of 0.361 on the validation dataset.
arXiv Detail & Related papers (2023-03-18T14:08:06Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z) - Automated Video Labelling: Identifying Faces by Corroborative Evidence [79.44208317138784]
We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities.
We provide a novel, simple, method for determining if a person is famous or not using image-search engines.
We show that even for less-famous people, image-search engines can be used for corroborative evidence to accurately label faces that are named in the scene or the speech.
arXiv Detail & Related papers (2021-02-10T18:57:52Z) - Group-Level Emotion Recognition Using a Unimodal Privacy-Safe
Non-Individual Approach [0.0]
This article presents our unimodal privacy-safe and non-individual proposal for the audio-video group emotion recognition subtask at the Emotion Recognition in the Wild (EmotiW) Challenge 2020 1.
arXiv Detail & Related papers (2020-09-15T12:25:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.