An audiovisual and contextual approach for categorical and continuous
emotion recognition in-the-wild
- URL: http://arxiv.org/abs/2107.03465v1
- Date: Wed, 7 Jul 2021 20:13:17 GMT
- Title: An audiovisual and contextual approach for categorical and continuous
emotion recognition in-the-wild
- Authors: Panagiotis Antoniadis, Ioannis Pikoulis, Panagiotis P. Filntisis,
Petros Maragos
- Abstract summary: We tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW)
Standard methodologies that rely solely on the extraction of facial features often fall short of accurate emotion prediction in cases where the aforementioned source of affective information is inaccessible due to head/body orientation, low resolution and poor illumination.
We aspire to alleviate this problem by leveraging bodily as well as contextual features, as part of a broader emotion recognition framework.
- Score: 27.943550651941166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we tackle the task of video-based audio-visual emotion
recognition, within the premises of the 2nd Workshop and Competition on
Affective Behavior Analysis in-the-wild (ABAW). Standard methodologies that
rely solely on the extraction of facial features often fall short of accurate
emotion prediction in cases where the aforementioned source of affective
information is inaccessible due to head/body orientation, low resolution and
poor illumination. We aspire to alleviate this problem by leveraging bodily as
well as contextual features, as part of a broader emotion recognition
framework. A standard CNN-RNN cascade constitutes the backbone of our proposed
model for sequence-to-sequence (seq2seq) learning. Apart from learning through
the \textit{RGB} input modality, we construct an aural stream which operates on
sequences of extracted mel-spectrograms. Our extensive experiments on the
challenging and newly assembled Affect-in-the-wild-2 (Aff-Wild2) dataset verify
the superiority of our methods over existing approaches, while by properly
incorporating all of the aforementioned modules in a network ensemble, we
manage to surpass the previous best published recognition scores, in the
official validation set. All the code was implemented using
PyTorch\footnote{\url{https://pytorch.org/}} and is publicly
available\footnote{\url{https://github.com/PanosAntoniadis/NTUA-ABAW2021}}.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Cross-Domain First Person Audio-Visual Action Recognition through
Relative Norm Alignment [15.545769463854915]
First person action recognition is an increasingly researched topic because of the growing popularity of wearable cameras.
This is bringing to light cross-domain issues that are yet to be addressed in this context.
We propose to leverage over the intrinsic complementary nature of audio-visual signals to learn a representation that works well on data seen during training.
arXiv Detail & Related papers (2021-06-03T08:46:43Z) - Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional
Architectures in a Contextual Approach for Video-Based Visual Emotion
Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild.
Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction.
We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - A Self-Reasoning Framework for Anomaly Detection Using Video-Level
Labels [17.615297975503648]
Alous event detection in surveillance videos is a challenging and practical research problem among image and video processing community.
We propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels.
The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and Ped2.
arXiv Detail & Related papers (2020-08-27T02:14:15Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.