Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional
Architectures in a Contextual Approach for Video-Based Visual Emotion
Recognition in the Wild
- URL: http://arxiv.org/abs/2105.07484v1
- Date: Sun, 16 May 2021 17:31:59 GMT
- Title: Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional
Architectures in a Contextual Approach for Video-Based Visual Emotion
Recognition in the Wild
- Authors: Ioannis Pikoulis, Panagiotis P. Filntisis, Petros Maragos
- Abstract summary: We tackle the task of video-based visual emotion recognition in the wild.
Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction.
We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
- Score: 31.40575057347465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we tackle the task of video-based visual emotion recognition in
the wild. Standard methodologies that rely solely on the extraction of bodily
and facial features often fall short of accurate emotion prediction in cases
where the aforementioned sources of affective information are inaccessible due
to head/body orientation, low resolution and poor illumination. We aspire to
alleviate this problem by leveraging visual context in the form of scene
characteristics and attributes, as part of a broader emotion recognition
framework. Temporal Segment Networks (TSN) constitute the backbone of our
proposed model. Apart from the RGB input modality, we make use of dense Optical
Flow, following an intuitive multi-stream approach for a more effective
encoding of motion. Furthermore, we shift our attention towards skeleton-based
learning and leverage action-centric data as means of pre-training a
Spatial-Temporal Graph Convolutional Network (ST-GCN) for the task of emotion
recognition. Our extensive experiments on the challenging Body Language Dataset
(BoLD) verify the superiority of our methods over existing approaches, while by
properly incorporating all of the aforementioned modules in a network ensemble,
we manage to surpass the previous best published recognition scores, by a large
margin.
Related papers
- Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition [53.359383163184425]
We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks.
This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network.
arXiv Detail & Related papers (2024-06-20T07:24:47Z) - Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences [4.740624855896404]
We propose a contrastive learning framework utilizing selective strong augmentation for self-supervised gait-based emotion representation.
Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
arXiv Detail & Related papers (2024-05-08T09:13:10Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z) - An audiovisual and contextual approach for categorical and continuous
emotion recognition in-the-wild [27.943550651941166]
We tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW)
Standard methodologies that rely solely on the extraction of facial features often fall short of accurate emotion prediction in cases where the aforementioned source of affective information is inaccessible due to head/body orientation, low resolution and poor illumination.
We aspire to alleviate this problem by leveraging bodily as well as contextual features, as part of a broader emotion recognition framework.
arXiv Detail & Related papers (2021-07-07T20:13:17Z) - Adaptive Intermediate Representations for Video Understanding [50.64187463941215]
We introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding.
We propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task.
We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
arXiv Detail & Related papers (2021-04-14T21:37:23Z) - Neural Networks for Semantic Gaze Analysis in XR Settings [0.0]
We present a novel approach which minimizes time and information necessary to annotate volumes of interest.
We train convolutional neural networks (CNNs) on synthetic data sets derived from virtual models using image augmentation techniques.
We evaluate our method in real and virtual environments, showing that the method can compete with state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-18T18:05:01Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z) - Complex Human Action Recognition in Live Videos Using Hybrid FR-DL
Method [1.027974860479791]
We address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences.
We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method.
We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short.
arXiv Detail & Related papers (2020-07-06T15:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.