EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's
Principle
- URL: http://arxiv.org/abs/2003.06692v1
- Date: Sat, 14 Mar 2020 19:55:21 GMT
- Title: EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's
Principle
- Authors: Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya, Rohan Chandra,
Aniket Bera and Dinesh Manocha
- Abstract summary: We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images.
Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition.
We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
- Score: 71.47160118286226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present EmotiCon, a learning-based algorithm for context-aware perceived
human emotion recognition from videos and images. Motivated by Frege's Context
Principle from psychology, our approach combines three interpretations of
context for emotion recognition. Our first interpretation is based on using
multiple modalities(e.g. faces and gaits) for emotion recognition. For the
second interpretation, we gather semantic context from the input image and use
a self-attention-based CNN to encode this information. Finally, we use depth
maps to model the third interpretation related to socio-dynamic interactions
and proximity among agents. We demonstrate the efficiency of our network
through experiments on EMOTIC, a benchmark dataset. We report an Average
Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8
over prior methods. We also introduce a new dataset, GroupWalk, which is a
collection of videos captured in multiple real-world settings of people
walking. We report an AP of 65.83 across 4 categories on GroupWalk, which is
also an improvement over prior methods.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - GiMeFive: Towards Interpretable Facial Emotion Classification [1.1468563069298348]
Deep convolutional neural networks have been shown to successfully recognize facial emotions.
We propose our model GiMeFive with interpretations, i.e., via layer activations and gradient-weighted class mapping.
Empirical results show that our model outperforms the previous methods in terms of accuracy.
arXiv Detail & Related papers (2024-02-24T00:37:37Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Mutilmodal Feature Extraction and Attention-based Fusion for Emotion
Estimation in Videos [16.28109151595872]
We introduce our submission to the CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW)
We exploited multimodal features extracted from video of different lengths from the competition dataset, including audio, pose and images.
Our system achieves the performance of 0.361 on the validation dataset.
arXiv Detail & Related papers (2023-03-18T14:08:06Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - Context Based Emotion Recognition using EMOTIC Dataset [22.631542327834595]
We present EMOTIC, a dataset of images of people annotated with their apparent emotion.
Using the EMOTIC dataset we train different CNN models for emotion recognition.
Our results show how scene context provides important information to automatically recognize emotional states.
arXiv Detail & Related papers (2020-03-30T12:38:50Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z) - Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping [55.72376663488104]
We present an autoencoder-based approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data.
Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in the encoder.
We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings.
arXiv Detail & Related papers (2019-11-20T05:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.